[ CATALOGS ] CMIP6 data analysis at NCI // "file_type = " issue

The CMIP6 data archive on NCI is obviously complex and there’s a lot there to get right for confident data provenance - models, experiments, ensembles, versions, variables, time frequencies, many file paths, etc.

The move towards data catalogs at NCI offers users improvements in data discovery and analysis workflows that are exciting, especially when you think about the new xarray.DataTree data structure.

ACCESS-NRI has great docs:

But still I’m a little bit confused and lacking confidence in the NCI cmip6_fs38 catalog in particular. When working with CMIP6 data one approach is to figure out all the file paths and glob everything together. This has, as I see it, always been done via the “latest” directory, as I understand it an ESGF convention that allows users to only look at the latest NetCDF files for a particular CMIP dataset.

However in the catalog approach I’ve seen examples where the "file_type = " search filter is used to remove all the extra versions of NetCDF files that are not “the latest”. Obviously getting this wrong can lead to duplicates for a given search ( which will likely become obvious when the file list is opened using xarray ) or worse might point to the wrong versions of some files.

I’ve seen ( and used myself ) both the “f” and the “l” file_type catalog search filter. At different times I’ve convinced myself that one, or the other, is the correct filter to only get the latest file. But I’ve never been able to find any NCI documentation on what “f” or “l” mean or what their search filter purpose is.

Lately I settled on using “l” as I assumed that meant “latest” or “link” (the latest directory is made up of logical links). But today when processing piControl data for ACCESS-ESM1.5 “l” gave a list with duplicates where “f” gave the correct list. :thinking:

I reached out to NCI and they’ve said that, unfortunately, file_type = can’t be relied on to filter only the “latest” files. Maybe I’m the only one making the assumption that file_type provided this function? But if we are going to be able to use the CMIP6 catalog, IMO, we’ll need some way of filtering for the “latest” files only.

The good news is that NCI have mentioned (without a timeframe) they are working on an update to the fs38 catalog that sounds like they will (hopefully) implement version = "latest"?

If you’ve been using file_type = thinking it was giving you only the latest files this might be useful?

I might also note here that replicas of other ensembles don’t seem to have a “latest” link. I was poking around with historical MIROC6 monthly 3D ocean temp (thetao) on gadi the other day, and there seems to be only one version per ensemble member? but there’s definitely different versions for different ensemble members, and I can’t find a latest symlink

1 Like