Datastore making - unidentified realm

Hello

I am getting an error in making a datastore for processed ACESS-OM2 files. These are density-binned files procesed offline from the 01deg_jra55v140_iaf_cycle3_antarctic_tracers run. The error appears related to the fact that the files are not identified to have any standard intake catalog “realm”, presumably due to the offline density binning. Code and error message detalis are below. Anyone has an idea of how to deal with this?

Thanks,

Aviv

p.s.1 the 01deg_jra55v140_iaf_cycle3_antarctic_tracers run is in the process of integrated into the intake catalog.

p.s.2 the datastore making script

from access_nri_intake.source import builders
from access_nri_intake.experiment import use_datastore

FoldSigBinned = ‘/g/data/g40/access-om2/archive/01deg_jra55v140_iaf_cycle3_antarctic_tracers_isopyc/’

FoldDBs = ‘/scratch/v45/as2408/access-om2/01deg_jra55v140_iaf_cycle3_antarctic_tracers/DBs/’

BUILDER = builders.AccessOm2Builder
CATALOG_DIR = FoldDBs
DATASTORE_NAME = “Exp9tracersSigAvgs”

esm_ds = use_datastore(
builder=BUILDER,
experiment_dir=FoldSigBinned,
catalog_dir=CATALOG_DIR,
datastore_name=DATASTORE_NAME
)

esm_ds

p.s.3 The error message:

---------------------------------------------------------------------------
ParserError                               Traceback (most recent call last)
Cell In[112], line 5
      2 CATALOG_DIR = FoldDBs# Path("~/catalog_dir").expanduser() # We'll save our datastore in a directory called catalog_dir in our home dir
      3 DATASTORE_NAME = "Exp9tracersSigAvgs"
----> 5 esm_ds = use_datastore(
      6     builder=BUILDER,
      7     experiment_dir=FoldSigBinned,
      8     catalog_dir=CATALOG_DIR,
      9     datastore_name=DATASTORE_NAME
     10 )
     12 esm_ds

File /g/data/xp65/public/apps/med_conda/envs/analysis3-26.01/lib/python3.11/site-packages/access_nri_intake/experiment/main.py:122, in use_datastore(experiment_dir, builder, catalog_dir, builder_kwargs, open_ds, datastore_name, description)
    120 builder_instance: Builder = builder(path=str(experiment_dir), **builder_kwargs)
    121 print(f"{f_info}Building esm-datastore...{f_reset}")
--> 122 builder_instance.get_assets().build()
    123 print(f"{f_success}Sucessfully built esm-datastore!{f_reset}")
    124 print(
    125     f"{f_info}Saving esm-datastore to {f_path}{str(catalog_dir.absolute())}{f_reset}"
    126 )

File /g/data/xp65/public/apps/med_conda/envs/analysis3-26.01/lib/python3.11/site-packages/access_nri_intake/source/builders.py:237, in BaseBuilder.build(self)
    232 def build(self):
    233     """
    234     Builds a datastore from a list of netCDF files or zarr stores.
    235     """
--> 237     self.get_assets().validate_parser().parse().clean_dataframe()
    239     return self

File /g/data/xp65/public/apps/med_conda/envs/analysis3-26.01/lib/python3.11/site-packages/access_nri_intake/source/builders.py:227, in BaseBuilder.validate_parser(self)
    224         validate_against_schema(info, ESM_JSONSCHEMA)
    225         return self
--> 227 raise ParserError(f"""Parser returns no valid assets.
    228     Try parsing a single file with Builder.parser(file)
    229     Last failed asset: {asset}
    230     Asset parser return: {info}""")

ParserError: Parser returns no valid assets.
            Try parsing a single file with Builder.parser(file)
            Last failed asset: /g/data/g40/access-om2/archive/01deg_jra55v140_iaf_cycle3_antarctic_tracers_isopyc/passive_wilkes_tr_DensityBinned_monthly-mean-ym_1999_12.nc
            Asset parser return: {'INVALID_ASSET': '/g/data/g40/access-om2/archive/01deg_jra55v140_iaf_cycle3_antarctic_tracers_isopyc/passive_wilkes_tr_DensityBinned_monthly-mean-ym_1999_12.nc', 'TRACEBACK': 'Traceback (most recent call last):\n  File "/g/data/xp65/public/apps/med_conda/envs/analysis3-26.01/lib/python3.11/site-packages/access_nri_intake/source/builders.py", line 207, in _parser_catch_invalid\n    return cls.parser(file)\n           ^^^^^^^^^^^^^^^^\n  File "/g/data/xp65/public/apps/med_conda/envs/analysis3-26.01/lib/python3.11/site-packages/access_nri_intake/source/builders.py", line 470, in parser\n    raise ParserError(f"Cannot determine realm for file {file}")\naccess_nri_intake.source.builders.ParserError: Cannot determine realm for file /g/data/g40/access-om2/archive/01deg_jra55v140_iaf_cycle3_antarctic_tracers_isopyc/passive_wilkes_tr_DensityBinned_monthly-mean-ym_1999_12.nc\n'}

Hey Aviv,

Just got brought here via a github notification - thanks for spotting this one.

I’ll have a dig tomorrow when I’m back at work and see if I can figure out the issue.

Cheers, Charles

Great, thank you Charles!

Okay, I’ve had a prod.

Our builder for ACCESS-OM2 assumes that output is structured into the following directory structure:

/g/data/ik11/outputs/access-om2-01/01deg_jra55v140_iaf_cycle3_antarctic_tracers/output577
├── atmosphere
│   └── ...
├── ice
│   └── ...
└── ocean
    ├── # Left the real files in - these are the relevant ones here
    ├── adelie_xflux_adv.nc
    ├── adelie_yflux_adv.nc

And then we extract the realm from the directory name. This is why it’s not picking up the realm here.

@joshuatorrance has been working on a metadata spec for ACCESS model outputs, I believe that includes the realm as a piece of metadata but I can’t remember 100% (this is why I’ve tagged you Josh!).

I think the surrounding tooling (addmeta I think?) is now at the point where we could use it to add the metadata to those files (if that’s a viable option) and then update the OM2 builder to look for that if necessary?

Alternatively, we could update use_datastore and the surrounding functionality to allow users to specify a realm (or perhaps a realm parsing function?) as a back door for situations like this. I’m not sure how keen I am on that - it feels like we might be opening a can of worms where we start patching hacks onto hacks onto hacks…

Anyway, I’ll wait for Josh to chime in, and have a think about alternative strategies to solve this in the meantime.

Hi Aviv & Charles,

I had a super quick look at a couple of the files and there’s no metadata attached to these data at the moment. addmeta could by used to easily add the metadata but I’m pretty sure we’d still need to tweak the OM2 builder to fall back to using a given file’s metadata if the usual methods to determine realm failed (maybe we should do that regardless).

By far the quickest and easiest fix is, as Charles suggested above, put the .nc files under output000/ocean/*.nc and the builder should be able to parse the realm out of the path.

@CharlesTurner We should check were we left parsing the realm out of the .nc metadata! If it’s not already there we should tweak things to fall back on file metadata when the other parsing fails.

Josh

Yeah, I was thinking we should probably add that to the builder anyway.

If updating the directory structure is viable, I think that would be the best place to start, as it means we aren’t going to be modifying files.

I’ll drop in a PR with the metadata check in the builders at some point today.

Hi Guys

Thanks for your suggestions!

I moved the netcdf files to /g/data/g40/access-om2/archive/01deg_jra55v140_iaf_cycle3_antarctic_tracers_isopyc/output000/ocean/

Now the “use_datastore” call ended without an error (although warnings were issued - see below).

Unfortunately, the variables were not identified - when I query “Exp9tracerBudgetSigAvgs_datastore.unique().variable”, I only get, ”[‘xt_ocean’, ‘yt_ocean’, ‘xu_ocean’, ‘yu_ocean’]”.

The actual tracer diagnostics variables (named below) are missing from the datastore. Is there another possible immediate fix, or is it better to try one of the other suggestions you made?

Thanks,

Aviv

p.s.1

The missing variables are of the form “passive_TRACERNAME_DIAGNOSTICNAME’“, where

TRACERNAME is peninsula, weddell, maud, wilkes, prydz, george, adelie, ross, or amundsen

DIAGNOSTICNAME is x/yflux_adv_isopyc, x/y/zflux_adv_diapyc, vdiffuse_impl, or an empy string

p.s.2

The warnings received in running the datastore make function (these warnings repeated a large number of times).

/g/data/xp65/public/apps/med_conda/envs/analysis3-26.01/lib/python3.11/site-packages/wavespectra/output/ww3.py:25: ResourceWarning: unclosed file <_io.TextIOWrapper name='/g/data/xp65/public/apps/med_conda/envs/analysis3-26.01/lib/python3.11/site-packages/wavespectra/output/ww3.yml' mode='r' encoding='UTF-8'>
  VAR_ATTRIBUTES = yaml.load(
ResourceWarning: Enable tracemalloc to get the object allocation traceback
/g/data/xp65/public/apps/med_conda/envs/analysis3-26.01/lib/python3.11/site-packages/access_nri_intake/source/utils.py:440: UserWarning: The frequency '(1, 'mon')' determined from filename does not match the frequency 'fx' determined from the file contents. Using '(1, 'mon')'.
  warnings.warn(f"{msg} Using '{frequency}'.")
/g/data/xp65/public/apps/med_conda/envs/analysis3-26.01/lib/python3.11/site-packages/access_nri_intake/source/utils.py:279: UserWarning: Time coordinate does not include bounds information. Guessing start and end times.

p.s.3

The datastore fucntion call

Sucessfully built esm-datastore!
Saving esm-datastore to /scratch/v45/as2408/access-om2/01deg_jra55v140_iaf_cycle3_antarctic_tracers/DBs
Successfully wrote ESM catalog json file to: file:///scratch/v45/as2408/access-om2/01deg_jra55v140_iaf_cycle3_antarctic_tracers/DBs/Exp9tracersSigAvgs.json
Hashing catalog to prevent unnecessary rebuilds.
This may take some time...
Catalog sucessfully hashed!
Datastore sucessfully written to /scratch/v45/as2408/access-om2/01deg_jra55v140_iaf_cycle3_antarctic_tracers/DBs/Exp9tracersSigAvgs.json!
Please note that this has not added the datastore to the access-nri-intake catalog.
To add to catalog, please run 'scaffold_catalog_entry' for help on how to do so.

Exp9tracersSigAvgs catalog with 9 dataset(s) from 19077 asset(s):

unique
filename 19077
path 19077
file_id 9
frequency 2
start_date 240
end_date 240
variable 4
variable_long_name 3
variable_standard_name 1
variable_cell_methods 1
variable_units 1
realm 1
temporal_label 1
derived_variable 0

asd

(Sorry I didn’t see this yesterday!)

I’ve had a look, and this is because the AccessOm2Builder expects files to be have names starting with either ocean or iceh :

class AccessOm2Builder(BaseBuilder):
    """Intake-ESM datastore builder for ACCESS-OM2 COSIMA datasets"""

    PATTERNS = [
        rf"^iceh.*\.({PATTERNS_HELPERS['ymd']}|{PATTERNS_HELPERS['ym']}).*$",  # ACCESS-ESM1.5/OM2/CM2 ice
        rf"^iceh.*\.(\d{{3}})-{PATTERNS_HELPERS['not_multi_digit']}.*",  # ACCESS-OM2 ice
        rf"^ocean.*[_,-](?:ymd|ym|y)_({PATTERNS_HELPERS['ymd']}|{PATTERNS_HELPERS['ym']}|{PATTERNS_HELPERS['y']})(?:$|[_,-]{PATTERNS_HELPERS['not_multi_digit']}.*)",  # ACCESS-OM2 ocean
        r"^ocean.*[^\d]_(\d{2})$",  # A few wierd files in ACCESS-OM2 01deg_jra55v13_ryf9091
    ]
    ...

From here.

Don’t worry about the specifics if you don’t speak regex, but basically ^iceh*... says ‘find me a string starting with iceh followed by anything else’.

I’m pretty sure we can just patch this to fix the issue:

from access_nri_intake.source import builders
BUILDER = builders.AccessOm2Builder

BUILDER.PATTERNS.append(rf"^passives_wilkes.*[_,-](?:ymd|ym|y)_({PATTERNS_HELPERS['ymd']}|{PATTERNS_HELPERS['ym']}|{PATTERNS_HELPERS['y']})(?:$|[_,-]{PATTERNS_HELPERS['not_multi_digit']}.*)")

and then the rest of the code as before. I’ve had a look at the the directory structure and built a copy of the datastore to check, and I think that should capture everything you’re looking to?

Let me know if it doesn’t! I’ve put the full code block I used to build it below just in case that wasn’t super clear - note I’ve put it in different dir, with a different name.

from access_nri_intake.source import builders
from access_nri_intake.source.builders import PATTERNS_HELPERS
from access_nri_intake.experiment import use_datastore

BUILDER = builders.AccessOm2Builder
BUILDER.PATTERNS.append(rf"^passives_wilkes.*[_,-](?:ymd|ym|y)_({PATTERNS_HELPERS['ymd']}|{PATTERNS_HELPERS['ym']}|{PATTERNS_HELPERS['y']})(?:$|[_,-]{PATTERNS_HELPERS['not_multi_digit']}.*)")

esm_ds = use_datastore(
    builder=builders.AccessOm2Builder,
    experiment_dir='/g/data/ik11/outputs/access-om2-01/01deg_jra55v140_iaf_cycle3_antarctic_tracers/',
    catalog_dir='/scratch/tm70/ct1163/',
    datastore_name='test_cat_aviv',
    open_ds=True,
)

We will need to have a bit more of a think about how we deal with experiments where files are named in slightly less standard ways going forwards…

Hi Charles

I tried it now but the variables were not recognized. Looks like you ran it on “/g/data/ik11/outputs/access-om2-01/01deg_jra55v140_iaf_cycle3_antarctic_tracers/“ rather than on “/g/data/g40/access-om2/archive/01deg_jra55v140_iaf_cycle3_antarctic_tracers_isopyc/“. The former is the raw run output, while the latter folder has the offline processed files, which is where I had the issues. Sounds like a good direction though, is there an additional step that can do it perhaps?

Thanks,

Aviv

Hmm, weird…

I’ve just had a closer look at the directory and I think it should be

- BUILDER.PATTERNS.append(rf"^passives_wilkes.*[_,-](?:ymd|ym|y)_({PATTERNS_HELPERS['ymd']}|{PATTERNS_HELPERS['ym']}|{PATTERNS_HELPERS['y']})(?:$|[_,-]{PATTERNS_HELPERS['not_multi_digit']}.*)")
+ BUILDER.PATTERNS.append(rf"^passives_.*[_,-](?:ymd|ym|y)_({PATTERNS_HELPERS['ymd']}|{PATTERNS_HELPERS['ym']}|{PATTERNS_HELPERS['y']})(?:$|[_,-]{PATTERNS_HELPERS['not_multi_digit']}.*)")

The ls I did on the directory was so long I didn’t see a bunch of the other files in the directory. There are also some slightly weird looking filenames (eg. passive_adelie_tr_DensityBinned_monthly-mean-1995to1999.nc) - if you’re after those we’ll need to add another regex to capture those.

Hopefully that gives you something to work with - are you able to let me know if making that change gets you the files you need? If not, I assume the variables you’re after is in a file which still isn’t matching,
so if you can give me a sample filename for that we see if we can figure out a regex to grab those too.

OK, I tried the new regex, but the variables were still not identified.

The main files which contain the variables of interest are of the following form:

passive_wilkes_Fx_diapyc_DensityBinned_monthly-mean-ym_1999_12.nc
passive_wilkes_Fx_isopyc_DensityBinned_monthly-mean-ym_1999_12.nc
passive_wilkes_Fy_diapyc_DensityBinned_monthly-mean-ym_1999_12.nc
passive_wilkes_Fy_isopyc_DensityBinned_monthly-mean-ym_1999_12.nc
passive_wilkes_Fz_DensityBinned_monthly-mean-ym_1999_12.nc
passive_wilkes_diff_DensityBinned_monthly-mean-ym_1999_12.nc
passive_wilkes_tend_DensityBinned_monthly-mean-ym_1999_12.nc
passive_wilkes_tr_DensityBinned_monthly-mean-ym_1999_12.nc

and similar files:

  1. with identical file names with different region names instead of “wilkes”: peninsula, weddell, maud, wilkes, prydz, george, adelie, ross, or amundsen
  2. with different year and month specification (at the end of the file name)

Anyway, I’m guessing that if we can make a datastore with any one of these, the rest might be easy.

The multi-year average files that you spotted, like passive_adelie_tr_DensityBinned_monthly-mean-1995to1999, would be nice to datastore to, but not critical.

Sorry, I’ve just reread my last response and I realised there was a typo!

The pattern should have been

- BUILDER.PATTERNS.append(rf"^passives_wilkes.*[_,-](?:ymd|ym|y)_({PATTERNS_HELPERS['ymd']}|{PATTERNS_HELPERS['ym']}|{PATTERNS_HELPERS['y']})(?:$|[_,-]{PATTERNS_HELPERS['not_multi_digit']}.*)")
+ BUILDER.PATTERNS.append(rf"^passive_.*[_,-](?:ymd|ym|y)_({PATTERNS_HELPERS['ymd']}|{PATTERNS_HELPERS['ym']}|{PATTERNS_HELPERS['y']})(?:$|[_,-]{PATTERNS_HELPERS['not_multi_digit']}.*)")

ie. passive_ not passives_.

When I run that, ie.

from access_nri_intake.source import builders
from access_nri_intake.source.builders import PATTERNS_HELPERS
from access_nri_intake.experiment import use_datastore

BUILDER = builders.AccessOm2Builder
BUILDER.PATTERNS.append(rf"^passive_.*[_,-](?:ymd|ym|y)_({PATTERNS_HELPERS['ymd']}|{PATTERNS_HELPERS['ym']}|{PATTERNS_HELPERS['y']})(?:$|[_,-]{PATTERNS_HELPERS['not_multi_digit']}.*)")

esm_ds = use_datastore(
    builder=builders.AccessOm2Builder,
    experiment_dir='/g/data/ik11/outputs/access-om2-01/01deg_jra55v140_iaf_cycle3_antarctic_tracers/',
    catalog_dir='/scratch/tm70/ct1163/',
    datastore_name='test_cat_aviv',
    open_ds=True,
)

I pick up all those files correctly:

>>> sorted(esm_ds.unique().variable)
['age_global',
 'average_DT',
 'average_T1',
 'average_T2',
 'dzt',
 'evap',
 'evap_heat',
 'fprec',
 'fprec_melt_heat',
 'frazil_3d_int_z',
 'grid_xt_ocean',
 'grid_xu_ocean',
 'grid_yt_ocean',
 'grid_yu_ocean',
 'lprec',
 'lw_heat',
 'melt',
 'mld',
 'net_sfc_heating',
 'nv',
 'passive_adelie',
 'passive_adelie_xflux_adv',
 'passive_adelie_yflux_adv',
 'passive_adelie_zflux_adv',
 'passive_amundsen',
 'passive_george',
 'passive_maud',
 'passive_peninsula',
 'passive_prydz',
 'passive_prydz_xflux_adv',
 'passive_prydz_yflux_adv',
 'passive_prydz_zflux_adv',
 'passive_ross',
 'passive_ross_xflux_adv',
 'passive_ross_yflux_adv',
 'passive_ross_zflux_adv',
 'passive_weddell',
 'passive_weddell_xflux_adv',
 'passive_weddell_yflux_adv',
 'passive_weddell_zflux_adv',
 'passive_wilkes',
 'pbot_t',
 'pme_net',
 'pme_river',
 'pot_rho_0',
 'pot_rho_2',
 'potrho',
 'potrho_edges',
 'runoff',
 'salt',
 'salt_xflux_adv',
 'salt_yflux_adv',
 'sea_level',
 'sens_heat',
 'sfc_hflux_coupler',
 'sfc_hflux_from_runoff',
 'sfc_hflux_pme',
 'sfc_salt_flux_coupler',
 'sfc_salt_flux_ice',
 'sfc_salt_flux_restore',
 'st_edges_ocean',
 'st_ocean',
 'sw_edges_ocean',
 'sw_ocean',
 'swflx',
 'tau_x',
 'tau_y',
 'temp',
 'temp_xflux_adv',
 'temp_yflux_adv',
 'time',
 'time_bounds',
 'tx_trans',
 'tx_trans_int_z',
 'tx_trans_rho',
 'ty_trans',
 'ty_trans_int_z',
 'ty_trans_rho',
 'u',
 'uhrho_et',
 'v',
 'vhrho_nt',
 'wfiform',
 'wfimelt',
 'xt_ocean',
 'xt_ocean_sub01',
 'xu_ocean',
 'xu_ocean_sub01',
 'yt_ocean',
 'yt_ocean_sub01',
 'yu_ocean',
 'yu_ocean_sub01']

I’m pretty sure eg. 'passive_adelie_xflux_adv', 'passive_adelie_yflux_adv', 'passive_adelie_zflux_adv', are the variables you’re after, right?

Thanks for spotting that Charles!

I tried the fix but the variables were not recognized. Looks like you ran it on “/g/data/ik11/outputs/access-om2-01/01deg_jra55v140_iaf_cycle3_antarctic_tracers/“ again rather than on “/g/data/g40/access-om2/archive/01deg_jra55v140_iaf_cycle3_antarctic_tracers_isopyc/“. Maybe I’m missing something, are you able to check it on the latter folder?

Sorry! Trying to juggle too many plates and that completely slipped my attention, thanks for pointing it out.

I’m reproducing the exact same issue as you now, so I’ll hopefully have an understanding of the issue and a fix in a couple hours.

Cheers, Charles

Okay, took a lot longer than I was hoping - for some reason listing all the files in that directory in g40 is extremely slow, but I have an answer for why this is occurring.

The culprit is this line:

def append_attrs(self, var: str, attrs: dict) -> None:
    """
    Append attributes to the _VarInfo object, if the attribute has a
    'long_name' key.
    """
    if "long_name" not in attrs:
        return None
    ...

It turns out that we have a line in the Builders that ignores all variables without a long name attribute - and that in these files, there is no long name.

I’ve done a bit of git archaeology and it looks like this guard is totally vestigial, and just got stuck around due to the way this was refactored. I’m opening a PR to fix it now.

Unfortunately there are gonna be a couple of complications with rebuilding the datastore with the fixed guard due to the namespacing issues over at Build-esm-datastore failing in conda/analysis3-26.03 so rebuilding a working version of the datastore won’t be totally straightforward. I’ll do the same thing as there - rebuild the datastore and then change the permissions so you can copy it off somewhere - in the meantime.

I’ll update once I have a working datastore for you.

Hey Aviv,

Thanks for bearing with me on this one - took far longer to solve than it should have.

You should now be able to copy the datastore from /scratch/public/aviv-esm-datastore to wherever you like. There are two files you’ll need to copy in that directory:

  • /scratch/public/aviv-esm-datastore/test_cat_aviv-no-patterns.json
  • /scratch/public/aviv-esm-datastore/test_cat_aviv-no-patterns.csv

This can be opened via (with it in its current location)

import intake
intake.open_esm_datastore(
"/scratch/public/aviv-esm-datastore/test_cat_aviv-no-patterns.json", 
columns_with_iterables = ["variables", ...]
)

You should also be able to open it via use_esm_datastore, provided you fill out the file and directories to point to it. I’m not sure if this is how people are opening already built datastores? If so, just say, and I’ll work out the syntax - I hardly ever use it.

If you rename the file - I appreciate the name I’ve used is probably more reflective of the process by which it got built than it’s utility, then you’ll need to change the file field in the json file:

  "id": "test_cat_aviv-no-patterns",
  "description": "esm_datastore for the model output in '/g/data/g40/access-om2/archive/01deg_jra55v140_iaf_cycle3_antarctic_tracers_isopyc/'",
  "title": null,
  "last_updated": "2026-04-15T04:09:09Z",
 - "catalog_file": "test_cat_aviv-no-patterns.csv"
 + "catalog_file": "<NEW_FILENAME>.csv"

}

Hi Charles

Thanks very much! I copied and renamed the datastore files and changed the json file field accordingly as you suggested. It looks like all the necessary variables are there, and it is possible to load them from the datastore, but there are a few difficulties I couldn’t figure out yet.

When I try to load a variable an error is returned that there are two such datasets. e.g.

ds = Exp9tracerBudgetSigAvgs_datastore.search(variable="passive_amundsen_yflux_adv_isopyc", frequency="1mon",temporal_label='unknown').to_dask()

ValueError: Expected exactly one dataset. Received 2 datasets. Please refine your search on file_id or use \`.to_dataset_dict()\`.

(full details below)

So I tried loading the variable with .to_dataset_dict() instead of .to_dask(), to look for keywords to distinguish them etc. You can see an example printout in p.s.2 below. For all variable names I examined two datasets exist, one with 20 time samples, and another with 240 times steps (other dimensions being identical, I think). Both datasets start at the same time, i.e., the shorter appears to be redundant, but is messing up the simpler variable loading option.

The only way I could figure to use the .to_dask() load option is to specify in the variable upload the file_id property string appearing in the to_dataset_dict printout of the longer dataset under the intake_esm_dataset_key attribute, after removing the last part of its string (e.g. removing “.unknown” or “.mean”). I had two issues with it - one was that I had to figure out the file_id for each variable separately (looks like there are about 4 different keys and I think I know which belongs to which of the ~63 variable names). The other issue is that there is a weired non-repeatability - after loading some e.g. 10-20 variables, the load (search) function stops recognizing variables, even those it already did (with the error
``ValueError: Expected exactly one dataset. Received 0 datasets. Please refine your search on or use .to_dataset_dict().‘’. The kernel also sometime dies in those cases , although I overwrite the same variable (i.e. with no increase in memory uptake), and it’s just a lazy call as far as I understand - I’m just cycling through loading all the variables to see that it works, not variable processing post-load involved. Can you please check if you are getting similar behaviour? Or perpahs if there is a way to get rid of the shorter (20 months/time samples) variables from the datastore that would solve these issues?

I have put a minimal notebook with these steps at /scratch/public/aviv-notebookshare/ in case that helps to understand all this text.

Thanks,
Aviv


p.s. - full error message

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[15], line 1
----> 1 ds = Exp9tracerBudgetSigAvgs_datastore.search(variable="passive_amundsen_yflux_adv_isopyc", frequency="1mon",temporal_label='unknown').to_dask()
      2 ds

File /g/data/xp65/public/apps/med_conda/envs/analysis3-26.01/lib/python3.11/site-packages/intake_esm/core.py:912, in esm_datastore.to_dask(self, **kwargs)
    910 if len(self) != 1:  # quick check to fail more quickly if there are many results
    911     lens = self._aggr_lengths
--> 912     raise ValueError(
    913         f'Expected exactly one dataset. Received {len(self)} datasets. Please refine your search on {", ".join(lens.keys())} or use `.to_dataset_dict()`.'
    914     )
    915 res = self.to_dataset_dict(**{**kwargs, 'progressbar': False})
    916 if len(res) != 1:  # extra check in case kwargs did modify something

ValueError: Expected exactly one dataset. Received 2 datasets. Please refine your search on file_id or use `.to_dataset_dict()`.

---------------------------------------------------------------------------------

p.s.2 - full printout of datastore.search…to_dataset_dict
e.g., for the ``passive_maud_zflux_adv’’ variable:

dset_dict = Exp9tracerBudgetSigAvgs_datastore.search(variable="passive_maud_zflux_adv").to_dataset_dict()
dset_dict
{'ocean.1mon.isopycnal_bins:19.xt_ocean:3600.yt_ocean:510.unknown': <xarray.Dataset> Size: 6GB
 Dimensions:                 (time: 20, isopycnal_bins: 19, yt_ocean: 510,
                              xt_ocean: 3600)
 Coordinates:
   * time                    (time) datetime64[ns] 160B 1980-01-15T12:00:00 .....
   * isopycnal_bins          (isopycnal_bins) float64 152B 1.0 36.5 ... 37.5 38.0
   * yt_ocean                (yt_ocean) float64 4kB -81.11 -81.07 ... -59.03
   * xt_ocean                (xt_ocean) float64 29kB -279.9 -279.8 ... 79.95
 Data variables:
     passive_maud_zflux_adv  (time, isopycnal_bins, yt_ocean, xt_ocean) float64 6GB dask.array<chunksize=(1, 19, 510, 3600), meta=np.ndarray>
 Attributes: (12/13)
     UseNotes:                                 In a tracer budget, the vertica...
     NCO:                                      netCDF Operators version 5.1.3 ...
     intake_esm_vars:                          ['passive_maud_zflux_adv']
     intake_esm_attrs:file_id:                 ocean.1mon.isopycnal_bins:19.xt...
     intake_esm_attrs:frequency:               1mon
     intake_esm_attrs:variable:                isopycnal_bins,passive_maud_zfl...
     ...                                       ...
     intake_esm_attrs:variable_standard_name:  ,,,,
     intake_esm_attrs:variable_cell_methods:   ,,,,
     intake_esm_attrs:realm:                   ocean
     intake_esm_attrs:temporal_label:          unknown
     intake_esm_attrs:_data_format_:           netcdf
     intake_esm_dataset_key:                   ocean.1mon.isopycnal_bins:19.xt...,
 'ocean.1mon.isopycnal_bins:19.nv:2.xt_ocean:3600.yt_ocean:510.unknown': <xarray.Dataset> Size: 67GB
 Dimensions:                 (time: 240, isopycnal_bins: 19, yt_ocean: 510,
                              xt_ocean: 3600)
 Coordinates:
   * time                    (time) datetime64[ns] 2kB 1980-01-15T12:00:00 ......
   * isopycnal_bins          (isopycnal_bins) float64 152B 1.0 36.5 ... 37.5 38.0
   * yt_ocean                (yt_ocean) float64 4kB -81.11 -81.07 ... -59.03
   * xt_ocean                (xt_ocean) float64 29kB -279.9 -279.8 ... 79.95
 Data variables:
     passive_maud_zflux_adv  (time, isopycnal_bins, yt_ocean, xt_ocean) float64 67GB dask.array<chunksize=(1, 19, 510, 3600), meta=np.ndarray>
 Attributes:
     UseNotes:                                 In a tracer budget, the vertica...
     intake_esm_vars:                          ['passive_maud_zflux_adv']
     intake_esm_attrs:file_id:                 ocean.1mon.isopycnal_bins:19.nv...
     intake_esm_attrs:frequency:               1mon
     intake_esm_attrs:variable:                average_DT,average_T1,average_T...
     intake_esm_attrs:variable_long_name:      ,,,,,,,,tcell longitude,tcell l...
     intake_esm_attrs:variable_standard_name:  ,,,,,,,,,
     intake_esm_attrs:variable_cell_methods:   ,,,,,,,,,
     intake_esm_attrs:realm:                   ocean
     intake_esm_attrs:temporal_label:          unknown
     intake_esm_attrs:_data_format_:           netcdf
     intake_esm_dataset_key:                   ocean.1mon.isopycnal_bins:19.nv...}

p.s.

Hey Aviv,

Sorry - there’s a lot going on in there, so it’s a little tricky to keep track of all the various issues, but I think the problem you’re actually running into is probably that your dask workers are just running out of memory.

The datastore that you’re trying to open is pretty large - 19000 files. I’m guessing you’re not using a massive ARE instance? If I use an XXL, I can run your notebook without any errors.

Generally speaking, even lazy datasets incur some memory overhead to open, as the Python interpreter needs to keep track of eg. file handles, locations, how to join things, etc. These then all need to be passed to and duplicated between dask workers.

If you use a smaller ARE session, that means more files going to each dask worker - which is more likely to overload its memory.

Cheers, Charles

Thanks Charles, you are right. I tried an XXL session now and the strange errors I described did not reoccur.
As I mentioned there are some redundant variables with 20 time samples, overlapping with same-named variables containing the full 240 times steps. To load the latter, I had to figure out the file_id for each different variable, which will be a bit messy to keep track of down the line. Is there a way to avoid specifying the file_id or better yet to remove the shorter variables from the datastore?