Datastore making - unidentified realm

Hello

I am getting an error in making a datastore for processed ACESS-OM2 files. These are density-binned files procesed offline from the 01deg_jra55v140_iaf_cycle3_antarctic_tracers run. The error appears related to the fact that the files are not identified to have any standard intake catalog “realm”, presumably due to the offline density binning. Code and error message detalis are below. Anyone has an idea of how to deal with this?

Thanks,

Aviv

p.s.1 the 01deg_jra55v140_iaf_cycle3_antarctic_tracers run is in the process of integrated into the intake catalog.

p.s.2 the datastore making script

from access_nri_intake.source import builders
from access_nri_intake.experiment import use_datastore

FoldSigBinned = ‘/g/data/g40/access-om2/archive/01deg_jra55v140_iaf_cycle3_antarctic_tracers_isopyc/’

FoldDBs = ‘/scratch/v45/as2408/access-om2/01deg_jra55v140_iaf_cycle3_antarctic_tracers/DBs/’

BUILDER = builders.AccessOm2Builder
CATALOG_DIR = FoldDBs
DATASTORE_NAME = “Exp9tracersSigAvgs”

esm_ds = use_datastore(
builder=BUILDER,
experiment_dir=FoldSigBinned,
catalog_dir=CATALOG_DIR,
datastore_name=DATASTORE_NAME
)

esm_ds

p.s.3 The error message:

---------------------------------------------------------------------------
ParserError                               Traceback (most recent call last)
Cell In[112], line 5
      2 CATALOG_DIR = FoldDBs# Path("~/catalog_dir").expanduser() # We'll save our datastore in a directory called catalog_dir in our home dir
      3 DATASTORE_NAME = "Exp9tracersSigAvgs"
----> 5 esm_ds = use_datastore(
      6     builder=BUILDER,
      7     experiment_dir=FoldSigBinned,
      8     catalog_dir=CATALOG_DIR,
      9     datastore_name=DATASTORE_NAME
     10 )
     12 esm_ds

File /g/data/xp65/public/apps/med_conda/envs/analysis3-26.01/lib/python3.11/site-packages/access_nri_intake/experiment/main.py:122, in use_datastore(experiment_dir, builder, catalog_dir, builder_kwargs, open_ds, datastore_name, description)
    120 builder_instance: Builder = builder(path=str(experiment_dir), **builder_kwargs)
    121 print(f"{f_info}Building esm-datastore...{f_reset}")
--> 122 builder_instance.get_assets().build()
    123 print(f"{f_success}Sucessfully built esm-datastore!{f_reset}")
    124 print(
    125     f"{f_info}Saving esm-datastore to {f_path}{str(catalog_dir.absolute())}{f_reset}"
    126 )

File /g/data/xp65/public/apps/med_conda/envs/analysis3-26.01/lib/python3.11/site-packages/access_nri_intake/source/builders.py:237, in BaseBuilder.build(self)
    232 def build(self):
    233     """
    234     Builds a datastore from a list of netCDF files or zarr stores.
    235     """
--> 237     self.get_assets().validate_parser().parse().clean_dataframe()
    239     return self

File /g/data/xp65/public/apps/med_conda/envs/analysis3-26.01/lib/python3.11/site-packages/access_nri_intake/source/builders.py:227, in BaseBuilder.validate_parser(self)
    224         validate_against_schema(info, ESM_JSONSCHEMA)
    225         return self
--> 227 raise ParserError(f"""Parser returns no valid assets.
    228     Try parsing a single file with Builder.parser(file)
    229     Last failed asset: {asset}
    230     Asset parser return: {info}""")

ParserError: Parser returns no valid assets.
            Try parsing a single file with Builder.parser(file)
            Last failed asset: /g/data/g40/access-om2/archive/01deg_jra55v140_iaf_cycle3_antarctic_tracers_isopyc/passive_wilkes_tr_DensityBinned_monthly-mean-ym_1999_12.nc
            Asset parser return: {'INVALID_ASSET': '/g/data/g40/access-om2/archive/01deg_jra55v140_iaf_cycle3_antarctic_tracers_isopyc/passive_wilkes_tr_DensityBinned_monthly-mean-ym_1999_12.nc', 'TRACEBACK': 'Traceback (most recent call last):\n  File "/g/data/xp65/public/apps/med_conda/envs/analysis3-26.01/lib/python3.11/site-packages/access_nri_intake/source/builders.py", line 207, in _parser_catch_invalid\n    return cls.parser(file)\n           ^^^^^^^^^^^^^^^^\n  File "/g/data/xp65/public/apps/med_conda/envs/analysis3-26.01/lib/python3.11/site-packages/access_nri_intake/source/builders.py", line 470, in parser\n    raise ParserError(f"Cannot determine realm for file {file}")\naccess_nri_intake.source.builders.ParserError: Cannot determine realm for file /g/data/g40/access-om2/archive/01deg_jra55v140_iaf_cycle3_antarctic_tracers_isopyc/passive_wilkes_tr_DensityBinned_monthly-mean-ym_1999_12.nc\n'}

Hey Aviv,

Just got brought here via a github notification - thanks for spotting this one.

I’ll have a dig tomorrow when I’m back at work and see if I can figure out the issue.

Cheers, Charles

Great, thank you Charles!

Okay, I’ve had a prod.

Our builder for ACCESS-OM2 assumes that output is structured into the following directory structure:

/g/data/ik11/outputs/access-om2-01/01deg_jra55v140_iaf_cycle3_antarctic_tracers/output577
├── atmosphere
│   └── ...
├── ice
│   └── ...
└── ocean
    ├── # Left the real files in - these are the relevant ones here
    ├── adelie_xflux_adv.nc
    ├── adelie_yflux_adv.nc

And then we extract the realm from the directory name. This is why it’s not picking up the realm here.

@joshuatorrance has been working on a metadata spec for ACCESS model outputs, I believe that includes the realm as a piece of metadata but I can’t remember 100% (this is why I’ve tagged you Josh!).

I think the surrounding tooling (addmeta I think?) is now at the point where we could use it to add the metadata to those files (if that’s a viable option) and then update the OM2 builder to look for that if necessary?

Alternatively, we could update use_datastore and the surrounding functionality to allow users to specify a realm (or perhaps a realm parsing function?) as a back door for situations like this. I’m not sure how keen I am on that - it feels like we might be opening a can of worms where we start patching hacks onto hacks onto hacks…

Anyway, I’ll wait for Josh to chime in, and have a think about alternative strategies to solve this in the meantime.

Hi Aviv & Charles,

I had a super quick look at a couple of the files and there’s no metadata attached to these data at the moment. addmeta could by used to easily add the metadata but I’m pretty sure we’d still need to tweak the OM2 builder to fall back to using a given file’s metadata if the usual methods to determine realm failed (maybe we should do that regardless).

By far the quickest and easiest fix is, as Charles suggested above, put the .nc files under output000/ocean/*.nc and the builder should be able to parse the realm out of the path.

@CharlesTurner We should check were we left parsing the realm out of the .nc metadata! If it’s not already there we should tweak things to fall back on file metadata when the other parsing fails.

Josh

1 Like

Yeah, I was thinking we should probably add that to the builder anyway.

If updating the directory structure is viable, I think that would be the best place to start, as it means we aren’t going to be modifying files.

I’ll drop in a PR with the metadata check in the builders at some point today.

1 Like

Hi Guys

Thanks for your suggestions!

I moved the netcdf files to /g/data/g40/access-om2/archive/01deg_jra55v140_iaf_cycle3_antarctic_tracers_isopyc/output000/ocean/

Now the “use_datastore” call ended without an error (although warnings were issued - see below).

Unfortunately, the variables were not identified - when I query “Exp9tracerBudgetSigAvgs_datastore.unique().variable”, I only get, ”[‘xt_ocean’, ‘yt_ocean’, ‘xu_ocean’, ‘yu_ocean’]”.

The actual tracer diagnostics variables (named below) are missing from the datastore. Is there another possible immediate fix, or is it better to try one of the other suggestions you made?

Thanks,

Aviv

p.s.1

The missing variables are of the form “passive_TRACERNAME_DIAGNOSTICNAME’“, where

TRACERNAME is peninsula, weddell, maud, wilkes, prydz, george, adelie, ross, or amundsen

DIAGNOSTICNAME is x/yflux_adv_isopyc, x/y/zflux_adv_diapyc, vdiffuse_impl, or an empy string

p.s.2

The warnings received in running the datastore make function (these warnings repeated a large number of times).

/g/data/xp65/public/apps/med_conda/envs/analysis3-26.01/lib/python3.11/site-packages/wavespectra/output/ww3.py:25: ResourceWarning: unclosed file <_io.TextIOWrapper name='/g/data/xp65/public/apps/med_conda/envs/analysis3-26.01/lib/python3.11/site-packages/wavespectra/output/ww3.yml' mode='r' encoding='UTF-8'>
  VAR_ATTRIBUTES = yaml.load(
ResourceWarning: Enable tracemalloc to get the object allocation traceback
/g/data/xp65/public/apps/med_conda/envs/analysis3-26.01/lib/python3.11/site-packages/access_nri_intake/source/utils.py:440: UserWarning: The frequency '(1, 'mon')' determined from filename does not match the frequency 'fx' determined from the file contents. Using '(1, 'mon')'.
  warnings.warn(f"{msg} Using '{frequency}'.")
/g/data/xp65/public/apps/med_conda/envs/analysis3-26.01/lib/python3.11/site-packages/access_nri_intake/source/utils.py:279: UserWarning: Time coordinate does not include bounds information. Guessing start and end times.

p.s.3

The datastore fucntion call

Sucessfully built esm-datastore!
Saving esm-datastore to /scratch/v45/as2408/access-om2/01deg_jra55v140_iaf_cycle3_antarctic_tracers/DBs
Successfully wrote ESM catalog json file to: file:///scratch/v45/as2408/access-om2/01deg_jra55v140_iaf_cycle3_antarctic_tracers/DBs/Exp9tracersSigAvgs.json
Hashing catalog to prevent unnecessary rebuilds.
This may take some time...
Catalog sucessfully hashed!
Datastore sucessfully written to /scratch/v45/as2408/access-om2/01deg_jra55v140_iaf_cycle3_antarctic_tracers/DBs/Exp9tracersSigAvgs.json!
Please note that this has not added the datastore to the access-nri-intake catalog.
To add to catalog, please run 'scaffold_catalog_entry' for help on how to do so.

Exp9tracersSigAvgs catalog with 9 dataset(s) from 19077 asset(s):

unique
filename 19077
path 19077
file_id 9
frequency 2
start_date 240
end_date 240
variable 4
variable_long_name 3
variable_standard_name 1
variable_cell_methods 1
variable_units 1
realm 1
temporal_label 1
derived_variable 0

asd

(Sorry I didn’t see this yesterday!)

I’ve had a look, and this is because the AccessOm2Builder expects files to be have names starting with either ocean or iceh :

class AccessOm2Builder(BaseBuilder):
    """Intake-ESM datastore builder for ACCESS-OM2 COSIMA datasets"""

    PATTERNS = [
        rf"^iceh.*\.({PATTERNS_HELPERS['ymd']}|{PATTERNS_HELPERS['ym']}).*$",  # ACCESS-ESM1.5/OM2/CM2 ice
        rf"^iceh.*\.(\d{{3}})-{PATTERNS_HELPERS['not_multi_digit']}.*",  # ACCESS-OM2 ice
        rf"^ocean.*[_,-](?:ymd|ym|y)_({PATTERNS_HELPERS['ymd']}|{PATTERNS_HELPERS['ym']}|{PATTERNS_HELPERS['y']})(?:$|[_,-]{PATTERNS_HELPERS['not_multi_digit']}.*)",  # ACCESS-OM2 ocean
        r"^ocean.*[^\d]_(\d{2})$",  # A few wierd files in ACCESS-OM2 01deg_jra55v13_ryf9091
    ]
    ...

From here.

Don’t worry about the specifics if you don’t speak regex, but basically ^iceh*... says ‘find me a string starting with iceh followed by anything else’.

I’m pretty sure we can just patch this to fix the issue:

from access_nri_intake.source import builders
BUILDER = builders.AccessOm2Builder

BUILDER.PATTERNS.append(rf"^passives_wilkes.*[_,-](?:ymd|ym|y)_({PATTERNS_HELPERS['ymd']}|{PATTERNS_HELPERS['ym']}|{PATTERNS_HELPERS['y']})(?:$|[_,-]{PATTERNS_HELPERS['not_multi_digit']}.*)")

and then the rest of the code as before. I’ve had a look at the the directory structure and built a copy of the datastore to check, and I think that should capture everything you’re looking to?

Let me know if it doesn’t! I’ve put the full code block I used to build it below just in case that wasn’t super clear - note I’ve put it in different dir, with a different name.

from access_nri_intake.source import builders
from access_nri_intake.source.builders import PATTERNS_HELPERS
from access_nri_intake.experiment import use_datastore

BUILDER = builders.AccessOm2Builder
BUILDER.PATTERNS.append(rf"^passives_wilkes.*[_,-](?:ymd|ym|y)_({PATTERNS_HELPERS['ymd']}|{PATTERNS_HELPERS['ym']}|{PATTERNS_HELPERS['y']})(?:$|[_,-]{PATTERNS_HELPERS['not_multi_digit']}.*)")

esm_ds = use_datastore(
    builder=builders.AccessOm2Builder,
    experiment_dir='/g/data/ik11/outputs/access-om2-01/01deg_jra55v140_iaf_cycle3_antarctic_tracers/',
    catalog_dir='/scratch/tm70/ct1163/',
    datastore_name='test_cat_aviv',
    open_ds=True,
)

We will need to have a bit more of a think about how we deal with experiments where files are named in slightly less standard ways going forwards…

Hi Charles

I tried it now but the variables were not recognized. Looks like you ran it on “/g/data/ik11/outputs/access-om2-01/01deg_jra55v140_iaf_cycle3_antarctic_tracers/“ rather than on “/g/data/g40/access-om2/archive/01deg_jra55v140_iaf_cycle3_antarctic_tracers_isopyc/“. The former is the raw run output, while the latter folder has the offline processed files, which is where I had the issues. Sounds like a good direction though, is there an additional step that can do it perhaps?

Thanks,

Aviv

Hmm, weird…

I’ve just had a closer look at the directory and I think it should be

- BUILDER.PATTERNS.append(rf"^passives_wilkes.*[_,-](?:ymd|ym|y)_({PATTERNS_HELPERS['ymd']}|{PATTERNS_HELPERS['ym']}|{PATTERNS_HELPERS['y']})(?:$|[_,-]{PATTERNS_HELPERS['not_multi_digit']}.*)")
+ BUILDER.PATTERNS.append(rf"^passives_.*[_,-](?:ymd|ym|y)_({PATTERNS_HELPERS['ymd']}|{PATTERNS_HELPERS['ym']}|{PATTERNS_HELPERS['y']})(?:$|[_,-]{PATTERNS_HELPERS['not_multi_digit']}.*)")

The ls I did on the directory was so long I didn’t see a bunch of the other files in the directory. There are also some slightly weird looking filenames (eg. passive_adelie_tr_DensityBinned_monthly-mean-1995to1999.nc) - if you’re after those we’ll need to add another regex to capture those.

Hopefully that gives you something to work with - are you able to let me know if making that change gets you the files you need? If not, I assume the variables you’re after is in a file which still isn’t matching,
so if you can give me a sample filename for that we see if we can figure out a regex to grab those too.

OK, I tried the new regex, but the variables were still not identified.

The main files which contain the variables of interest are of the following form:

passive_wilkes_Fx_diapyc_DensityBinned_monthly-mean-ym_1999_12.nc
passive_wilkes_Fx_isopyc_DensityBinned_monthly-mean-ym_1999_12.nc
passive_wilkes_Fy_diapyc_DensityBinned_monthly-mean-ym_1999_12.nc
passive_wilkes_Fy_isopyc_DensityBinned_monthly-mean-ym_1999_12.nc
passive_wilkes_Fz_DensityBinned_monthly-mean-ym_1999_12.nc
passive_wilkes_diff_DensityBinned_monthly-mean-ym_1999_12.nc
passive_wilkes_tend_DensityBinned_monthly-mean-ym_1999_12.nc
passive_wilkes_tr_DensityBinned_monthly-mean-ym_1999_12.nc

and similar files:

  1. with identical file names with different region names instead of “wilkes”: peninsula, weddell, maud, wilkes, prydz, george, adelie, ross, or amundsen
  2. with different year and month specification (at the end of the file name)

Anyway, I’m guessing that if we can make a datastore with any one of these, the rest might be easy.

The multi-year average files that you spotted, like passive_adelie_tr_DensityBinned_monthly-mean-1995to1999, would be nice to datastore to, but not critical.

Sorry, I’ve just reread my last response and I realised there was a typo!

The pattern should have been

- BUILDER.PATTERNS.append(rf"^passives_wilkes.*[_,-](?:ymd|ym|y)_({PATTERNS_HELPERS['ymd']}|{PATTERNS_HELPERS['ym']}|{PATTERNS_HELPERS['y']})(?:$|[_,-]{PATTERNS_HELPERS['not_multi_digit']}.*)")
+ BUILDER.PATTERNS.append(rf"^passive_.*[_,-](?:ymd|ym|y)_({PATTERNS_HELPERS['ymd']}|{PATTERNS_HELPERS['ym']}|{PATTERNS_HELPERS['y']})(?:$|[_,-]{PATTERNS_HELPERS['not_multi_digit']}.*)")

ie. passive_ not passives_.

When I run that, ie.

from access_nri_intake.source import builders
from access_nri_intake.source.builders import PATTERNS_HELPERS
from access_nri_intake.experiment import use_datastore

BUILDER = builders.AccessOm2Builder
BUILDER.PATTERNS.append(rf"^passive_.*[_,-](?:ymd|ym|y)_({PATTERNS_HELPERS['ymd']}|{PATTERNS_HELPERS['ym']}|{PATTERNS_HELPERS['y']})(?:$|[_,-]{PATTERNS_HELPERS['not_multi_digit']}.*)")

esm_ds = use_datastore(
    builder=builders.AccessOm2Builder,
    experiment_dir='/g/data/ik11/outputs/access-om2-01/01deg_jra55v140_iaf_cycle3_antarctic_tracers/',
    catalog_dir='/scratch/tm70/ct1163/',
    datastore_name='test_cat_aviv',
    open_ds=True,
)

I pick up all those files correctly:

>>> sorted(esm_ds.unique().variable)
['age_global',
 'average_DT',
 'average_T1',
 'average_T2',
 'dzt',
 'evap',
 'evap_heat',
 'fprec',
 'fprec_melt_heat',
 'frazil_3d_int_z',
 'grid_xt_ocean',
 'grid_xu_ocean',
 'grid_yt_ocean',
 'grid_yu_ocean',
 'lprec',
 'lw_heat',
 'melt',
 'mld',
 'net_sfc_heating',
 'nv',
 'passive_adelie',
 'passive_adelie_xflux_adv',
 'passive_adelie_yflux_adv',
 'passive_adelie_zflux_adv',
 'passive_amundsen',
 'passive_george',
 'passive_maud',
 'passive_peninsula',
 'passive_prydz',
 'passive_prydz_xflux_adv',
 'passive_prydz_yflux_adv',
 'passive_prydz_zflux_adv',
 'passive_ross',
 'passive_ross_xflux_adv',
 'passive_ross_yflux_adv',
 'passive_ross_zflux_adv',
 'passive_weddell',
 'passive_weddell_xflux_adv',
 'passive_weddell_yflux_adv',
 'passive_weddell_zflux_adv',
 'passive_wilkes',
 'pbot_t',
 'pme_net',
 'pme_river',
 'pot_rho_0',
 'pot_rho_2',
 'potrho',
 'potrho_edges',
 'runoff',
 'salt',
 'salt_xflux_adv',
 'salt_yflux_adv',
 'sea_level',
 'sens_heat',
 'sfc_hflux_coupler',
 'sfc_hflux_from_runoff',
 'sfc_hflux_pme',
 'sfc_salt_flux_coupler',
 'sfc_salt_flux_ice',
 'sfc_salt_flux_restore',
 'st_edges_ocean',
 'st_ocean',
 'sw_edges_ocean',
 'sw_ocean',
 'swflx',
 'tau_x',
 'tau_y',
 'temp',
 'temp_xflux_adv',
 'temp_yflux_adv',
 'time',
 'time_bounds',
 'tx_trans',
 'tx_trans_int_z',
 'tx_trans_rho',
 'ty_trans',
 'ty_trans_int_z',
 'ty_trans_rho',
 'u',
 'uhrho_et',
 'v',
 'vhrho_nt',
 'wfiform',
 'wfimelt',
 'xt_ocean',
 'xt_ocean_sub01',
 'xu_ocean',
 'xu_ocean_sub01',
 'yt_ocean',
 'yt_ocean_sub01',
 'yu_ocean',
 'yu_ocean_sub01']

I’m pretty sure eg. 'passive_adelie_xflux_adv', 'passive_adelie_yflux_adv', 'passive_adelie_zflux_adv', are the variables you’re after, right?

Thanks for spotting that Charles!

I tried the fix but the variables were not recognized. Looks like you ran it on “/g/data/ik11/outputs/access-om2-01/01deg_jra55v140_iaf_cycle3_antarctic_tracers/“ again rather than on “/g/data/g40/access-om2/archive/01deg_jra55v140_iaf_cycle3_antarctic_tracers_isopyc/“. Maybe I’m missing something, are you able to check it on the latter folder?