The CMIP6 data archive on NCI is obviously complex and there’s a lot there to get right for confident data provenance - models, experiments, ensembles, versions, variables, time frequencies, many file paths, etc.
The move towards data catalogs at NCI offers users improvements in data discovery and analysis workflows that are exciting, especially when you think about the new xarray.DataTree
data structure.
ACCESS-NRI has great docs:
But still I’m a little bit confused and lacking confidence in the NCI cmip6_fs38
catalog in particular. When working with CMIP6 data one approach is to figure out all the file paths and glob everything together. This has, as I see it, always been done via the “latest” directory, as I understand it an ESGF convention that allows users to only look at the latest NetCDF
files for a particular CMIP dataset.
However in the catalog approach I’ve seen examples where the "file_type = " search filter is used to remove all the extra versions of NetCDF files that are not “the latest”. Obviously getting this wrong can lead to duplicates for a given search ( which will likely become obvious when the file list is opened using xarray
) or worse might point to the wrong versions of some files.
I’ve seen ( and used myself ) both the “f” and the “l” file_type
catalog search filter. At different times I’ve convinced myself that one, or the other, is the correct filter to only get the latest file. But I’ve never been able to find any NCI documentation on what “f” or “l” mean or what their search filter purpose is.
Lately I settled on using “l” as I assumed that meant “latest” or “link” (the latest directory is made up of logical links). But today when processing piControl
data for ACCESS-ESM1.5
“l” gave a list with duplicates where “f” gave the correct list.
I reached out to NCI and they’ve said that, unfortunately, file_type =
can’t be relied on to filter only the “latest” files. Maybe I’m the only one making the assumption that file_type
provided this function? But if we are going to be able to use the CMIP6 catalog, IMO, we’ll need some way of filtering for the “latest” files only.
The good news is that NCI have mentioned (without a timeframe) they are working on an update to the fs38
catalog that sounds like they will (hopefully) implement version = "latest"
?
If you’ve been using file_type =
thinking it was giving you only the latest files this might be useful?