Adding WOA23 to the main catalogue

Hi all,

I noticed that intake catalogue contains datasets from World Ocean Atlas 2013, however there’s an updated WOA23 on/g/data/av17/access-nri/OM3/woa23. Can WOA23 please be added to the main catalogue? Thanks.

@CharlesTurner

Heya Polina,

WOA23 should be in the main catalog: see attached screencap.

If you can’t see it, I think that’s going to be because you’re using an old version of analysis3 - can you let me know what version you are using? Assuming that is the case, it’ll be an easy fix and it should be available to older versions of analysis3 in a few hours.

1 Like

Right, if I use new conda, I can see WOA23 in the catalogue. Thanks. FYI, I was on conda 25.11 before. Newer versions produce very noisy warnings when loading packages and starting a dask client. It’s been brought up on Hive before, I think @rbeucher is aware of it.

FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you. import pynvml

Thanks for the info.

For posterity, what’s happening is that we’ve now created a reformatted version of the catalog, which uses parquet (a different file format, with a bunch of benefits over csv) rather than csv files. Older versions of the software don’t know how to read this, so I’ve set things up to create separate catalogs - csv and parquet, with the choice of environment dictating which catalog you get.

analysis3/25.11 (and all previous version of the environment) must be hooking into the csv version of the catalog, rather than the newer parquet version.

I’m rebuilding the csv catalog to bring it in line with the parquet one now, and I’ll update the build process to update both catalogs so they don’t get out of sync again.

I’ll update once it’s done & I’ve checked WOA23 is available in the older environments again.

1 Like

Thanks for the info, Charles.
Any ideas what’s wrong here? I can’t load variable with .to_dask()..

catalog = intake.cat.access_nri 
woa = catalog.search(name='WOA23')
t_an = woa.search(variable='t_an').to_dask()

Error:

NotImplementedError                       Traceback (most recent call last)Cell In[6], line 1----> 1 t_an = woa.search(variable='t_an').to_dask()      2 t_an

File /g/data/xp65/public/apps/med_conda/envs/analysis3-26.01/lib/python3.11/site-packages/intake/source/base.py:191, in DataSourceBase.to_dask(self)
    189 def to_dask(self):
    190     """Return a dask container for this data source"""
--> 191     raise NotImplementedError

NotImplementedError: 

My script’s here /g/data/x77/ps7863/DeepArgo/DeepArgo_scripts/open_woa_ds.ipynb

I’m not 100% sure, but I think the offender is this line

catalog = intake.cat.access_nri 
woa = catalog.search(name='WOA23') # HERE
t_an = woa.search(variable='t_an').to_dask()

try:

- woa = catalog.search(name='WOA23')
+ woa = catalog['WOA23']

or

- woa = catalog.search(name='WOA23')
+ woa = catalog.search(name='WOA23').to_source()
  • Searching the top level catalog (intake.cat.access_nri.search(xyz...)) returns another intake-dataframe-catalog object, which is the container for experiments (AKA ESM-Datastores).
  • Indexing into the top level catalog intake.cat.access_nri[xyz...], gives the intake-esm source you’re after.
  • Doing intake.cat.access_nri.search(xyz...).to_source() does the same thing as intake.cat.access_nri[xyz...].

Let me know if that fixes it!

Okay, I’ve updated the csv catalog - you should be able to access WOA23 from old analysis3 versions now.

1 Like

Thanks Charles!

Hey Charles, there’s an issue with loading variables from WOA23 datasets in intake. It’s a lot to copy here but I created a notebook demonstrating errors, please check here /g/data/x77/ps7863/DeepArgo/scripts/open_woa_intake.ipynb

What I encountered is that there are 2 file_id’s in each variable but loading all of them using .to_dataset_dict(), as well as refining search by ‘file_id’ give errors..

JFYI, I got what I need from WOA23 by reading .netCDF files directly, so the fix isn’t urgent for me

1 Like

Just had a little prod, I can fix these errors by making the following changes:

- woa.search(variable='t_an', frequency='fx').to_dataset_dict()
+ woa.search(variable='t_an', frequency='fx').to_dataset_dict(xarray_open_kwargs={"decode_times":False})

and likewise in to_dask() calls.

I’m not quite sure what the precise cause of this will be - presumably intake-esm is setting some defaults surrounding date handling that are different to vanilla xarray. I’ll dig into that and come back and let you know (and if the defaults seem silly, we can change them).

EDIT: I’ve just double checked and in what I assume is the xarray script where you opened these files (/g/data/x77/ps7863/DeepArgo/scripts/open_woa_ds.ipynb), you’ve used decode_times=False:

woa = xr.open_mfdataset(
    '/g/data/av17/access-nri/OM3/woa23/woa23_B5C2*.nc', 
    decode_times=False,
)

If I remove this flag:

woa = xr.open_mfdataset(
    '/g/data/av17/access-nri/OM3/woa23/woa23_B5C2*.nc', 
-    decode_times=False,
)

I get the same error that intake throws.