Transition of CABLE to git and GitHub

The ACCESS-NRI is working towards transferring the CABLE software to GitHub. This topic is to discuss and settle the strategy that will be used to transfer CABLE to git and GitHub. Feel free to participate in the discussion by posting replies on this topic. This first post will be updated as the strategy evolves. Anything listed in this post is a proposition opened for discussion unless it is said otherwise.

In another topic, we’ll discuss the code management strategy for CABLE.

First off, it is good to understand what are the advantages of moving to git and GitHub:

  • the CABLE code can be made truly public which promotes collaboration. Users will be able to access the code and its documentation without the need for an account at NCI.
  • the tools available on the GitHub platform help ensure the integrity of the CABLE code with better tools for reviews, better testing with continuous integration, and better links between code and issues (known as tickets in TRAC).
  • it is easier to collaborate with git than with SVN since different individuals can directly contribute to the same branch of the code.

It is true Git can be harder to grasp than SVN initially, especially with the differences between local and remote repositories. But the benefits of git and GitHub outlast the initial confusion.

To transfer CABLE to GitHub, we have to transfer the following elements, each requiring its own strategy:

  • The source code
  • The tickets from TRAC
  • The CABLE documentation and CABLE wiki from TRAC
  • The data stored with the source code

The strategy for each of these elements is described in individual posts in this topic.

1 Like

The source code

Git has a tool called git-svn that can transform an SVN repository into a git repository. It is simple to use but decisions have to be made on what to transfer. SVN and git have different branching philosophies. Our CABLE SVN repository has a lot of stale and empty branches that we don’t want. We only want to keep:

  • the currently active branches for development
  • the branches used for running simulations from, e.g. branches with the code base for CABLE in ACCESS-ESM1.5 and CM2

Some special consideration is required for long-lived branches that currently exist. They are most likely falling into two categories:

  • branches with a short-to-medium-term goal to merge with the main version of CABLE
  • branches with no goal to merge with the main version of CABLE

More discussion with the maintainers of these branches is required before the transfer to ensure the solution fits their needs.

Close to the transition date, I’ll contact CABLE users to compile a list of branches of interest.

The tickets from TRAC

TRAC does not provide a system to export its tickets to an offline document (e.g. spreadsheet, PDF etc.). It is possible to export the description and the metadata on the ticket (author, milestones, status, keywords etc.) to a spreadsheet but not the comments on each ticket.

TRAC stores its tickets in an SQLite database and there are tools available to transfer those tickets to GitHub issues. This gives us a few options:

  1. Transfer all the tickets to the CABLE GitHub repository
  2. Transfer all the tickets to GitHub, keep the closed tickets in an archive repository and only transfer the active tickets to the CABLE GitHub repository
  3. Keep the tickets in a database and provide an interface to search and access the tickets.

More experimentation with the existing tools to transfer the tickets is needed to understand the viability of each option.

Data

Various forms of data is stored within the CABLE SVN repository in addition to the CABLE source code:

  1. Tumbarumba test case
  2. User scripts to prepare or post-process data for/from CABLE
  3. Ancillary input data for CABLE: spatial data already formatted for CABLE to use as initial conditions and static data.
  4. Archival data

We propose to separate the CABLE source code from any other data so that the CABLE repository is only for CABLE itself and its documentation

Tumbarumba test case

The Tumbarumba test case is included with CABLE’s source code specific to offline simulations. This was done so people can simply get the source code, compile it and directly run the test case. Simple. However, it makes it very hard to distinguish which files are part of the source code and which are part of the experiment setup and output. It is a lot nicer to run experiments from a directory separate from the source code.

We propose to move the Tumbarumba test case to a separate git repository. This repository would contain all the data needed to run CABLE at the Tumbarumba flux site. This method allows us to create a template for sharing test cases. We can then easily extend the collection of test cases for CABLE. The template would have to be adapted for spatial simulations as we wouldn’t share large datasets (e.g. meteorological forcing) via GitHub but it could follow the same principle.

In addition, on a shared server (at NCI for example), this setup offers the advantage it is then possible to provide a pre-compiled CABLE executable as a module for example and users don’t have to worry about the source code and can simply deal with the input information.

User scripts

There are currently no scripts distributed with the trunk version of CABLE. But other branches include various scripts in addition to the CABLE code.

For branches that need to be transferred to the GitHub CABLE repository, ideally, user scripts should be transferred to GitHub repositories.

Scripts used by a single user
In this case, the user can choose the solution they prefer. The only requirements are the scripts can not stay within the CABLE repository and the SVN repository will not stay around forever. The proposed solution in this case is for the user to move their scripts to repositories under their own GitHub account.

Scripts used and developed by a team
For these scripts, it is important to keep the possibility of collaborative development. We propose the developers of these scripts could move them to repositories under the CABLE LSM GitHub organisation.

Ancillary input data

This is the data that is stored under CABLE-AUX. We propose to manage this type of data like reference datasets following the standards for data management. This would allow us to satisfy two requirements:

  1. access to the data from any machine for anyone who wants to use CABLE
  2. versioning of the data independently from the CABLE source code

This will take some time to put in place. We propose to have a transition period when the CABLE code will be on GitHub while the ancillary data will still be sourced from the SVN repository (under CABLE-AUX).

Archival data

Some people may have used the CABLE repository as a means to archive their model setup to comply with journal requirements for example. This means we can not remove access to the SVN repository for some time. See the discussion about the future of the SVN repository for more details.

The proposal for the future is for users to use their institutional archival systems and/or GitHub and/or Zenodo.

Documentation and wiki

The documentation of CABLE and other information on the TRAC wiki is mostly out of date with some of it still relevant. The work has started to rejuvenate this documentation. It is happening in a GitHub repository using Mkdocs and FORD. The plan is for this GitHub repository to eventually become the home of the CABLE source code as well as its documentation. The documentation is published on GitHub pages: Welcome to the CABLE Land Surface Model documentation - CABLE Docs

The Mkdocs and FORD tools have been chosen as it enables us to keep the documentation with the code and changes to the documentation can be contributed at the same time as the corresponding changes to the code. In particular, FORD allows us to document the science in the CABLE model directly within the source code of CABLE which makes updating the scientific documentation seamless for developers.
Also, Mkdocs and FORD both use the Markdown syntax (used by GitHub and this forum as well) which means a lower barrier to contribution.

Future of the SVN repository

Once the transition is completed, we need to ensure:

  • users move to the git repository quickly
  • it is clear the SVN repository is not to be used anymore

We will need to keep the SVN repository for some time after the transition to GitHub, during that time, it would be good if the access could be restricted with only read access rights.

We will need to keep the SVN repository accessible in some form for 7 years as researchers may have referenced this repository in their published papers. Some more investigation is required to decide what is needed here.

Hi Claire,
Nice work on all of this. Given we’re going to need to keep the SVN repository as an archive, it seems like there’s very little to be lost in migrating to the git repo earlier rather than later. Getting users / user culture shifted over will be the issue. Once we know everyone has engaged we’ll probably know very quickly what can be left behind and what needs to be mapped across somehow. My preference would be to bring as little baggage as possible into the new system, and consult the older system if/when we need to, especially given it won’t be in the same format in the new system, so some of the context of the original material may be lost. My 2C :slight_smile:

1 Like

I agree. My only hesitation in transferring early is whether it is best to:

  1. Transfer now and then change the culture as we go (e.g. adding CI, PR templates with requirements for some testing, science review as well as code review,…).
  2. To fully add all these requirements from the start, we use the transition to GitHub as a signal to shift habits.

I think the second way gives us a sharp transition that encourages better the culture shift rather than a slow evolution. We can communicate and provide training more easily if everything comes at once rather than explaining each bit one by one. But, the danger is if the sudden changes required are perceived as too much to handle at once by the users.

I also would like to figure out the tickets as we need the open tickets to move over, and there are too many to transfer by hand. For this, I need access to the TRAC database that NCI is in the process of setting up.

How many users of the svn repository do we think are actually active developers? I occasionally browse some of the code via the trac page, but wouldn’t remember if/when I ever checked anything in or out - so perhaps it is a relatively small number who we will need target for the culture-shift.

Looking at the commits for the last 6 months, I could count 16 different user IDs. Most are regular committers who would quickly learn the rules (hopefully for them and me). A few are occasional so it might be trickier for them. For the occasional committers, it might be easier if the changes in development practices apply all at once but they could come after the transition.