Building intake catalog: "Parser returns no valid assets" error

Hi there, I am trying to build an intake datastore using someone else’s existing ESM1.5 experiments. The hope is to be able to build on existing code with this. However, I get this error below, and it also fails with the same error when I do it through the terminal, as suggested by Charles here. Could I please check if I’m misunderstanding something or if something is missing? Thank you!!

%%time

builder = AccessEsm15Builder(
    path="/g/data/e14/afp599/access-esm/fs38_processed",
    ensemble=False # We could use this to pass multiple paths for different ensemble members "/g/data/e14/afp599/access-esm/post_processed_mw2",
).build()
--------------------------------------------------------------------------
ParserError                               Traceback (most recent call last)
File <timed exec>:4

File /g/data/xp65/public/apps/med_conda/envs/analysis3-25.08/lib/python3.11/site-packages/access_nri_intake/source/builders.py:203, in BaseBuilder.build(self)
    198 def build(self):
    199     """
    200     Builds a datastore from a list of netCDF files or zarr stores.
    201     """
--> 203     self.get_assets().validate_parser().parse().clean_dataframe()
    205     return self

File /g/data/xp65/public/apps/med_conda/envs/analysis3-25.08/lib/python3.11/site-packages/access_nri_intake/source/builders.py:191, in BaseBuilder.validate_parser(self)
    188         validate_against_schema(info, ESM_JSONSCHEMA)
    189         return self
--> 191 raise ParserError(
    192     f"""Parser returns no valid assets.
    193     Try parsing a single file with Builder.parser(file)
    194     Last failed asset: {asset}
    195     Asset parser return: {info}"""
    196 )

ParserError: Parser returns no valid assets.
            Try parsing a single file with Builder.parser(file)
            Last failed asset: /g/data/e14/afp599/access-esm/fs38_processed/wfo_Omon_ACCESS-ESM1-5_ssp585_r9i1p1f1_2015-2100_r360x180.nc
            Asset parser return: {'INVALID_ASSET': '/g/data/e14/afp599/access-esm/fs38_processed/wfo_Omon_ACCESS-ESM1-5_ssp585_r9i1p1f1_2015-2100_r360x180.nc', 'TRACEBACK': 'Traceback (most recent call last):\n  File "/g/data/xp65/public/apps/med_conda/envs/analysis3-25.08/lib/python3.11/site-packages/access_nri_intake/source/builders.py", line 663, in parser\n    match_groups = re.match(r".*/([^/]*)/history/([^/]*)/.*\\.nc", file).groups()\n                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\nAttributeError: \'NoneType\' object has no attribute \'groups\'\n'}

Hi Ellie, I’m today’s triager. I’ll try to find a suitable helper for you.

Hi Ellie,

Typically this error results from files being named in ways that the builder doesn’t expect. I’ve requested to join e14 & I’ll update when I can check exactly what the issue is.

Okay, I’ve taken a look - these are all CMIP formatted data, which the ESM1.5 Builder doesn’t recognise the name patterns for - we fall back to using the filenames to work out things like frequencies.

I think we can pretty straightforwardly add a builder which handles CMIP formatted data - it’s a fairly minimal job now. I’m a bit tied up today - @joshuatorrance are you able to take a look at this?

also cc @dougiesquire, if you think adding a CmipBuilder is a terrible idea for any particular reason can you let us know?

I think adding this builder would be something of a stopgap measure - we’re aiming to completely break the filename dependency - but I don’t see any harm in having a builder tailored to this format of output in the meantime.

My only thought is that you may want to make it a Cmip6Builder as I think the file structure is different across different CMIP eras.

I think Charles has the gist of it.

The builder is expecting to see a path that looks something like this one:
/g/data/p73/archive/non-CMIP/ACCESS-ESM1-5/PI-GWL-B2035/history/atm/netCDF/*.nc
The builder’s regex is looking at the directories before and after history to determine the experiment_id and realm for each file.

Since this is CMIP data we can presumably pull the info we need straight out of each file’s metadata. Adding a Builder for CMIP is probably worthwhile and shouldn’t be difficult.

I’ve applied for e14 too, I’ll start on a CMIP/CMIP6 builder. EDIT: Charles beat me to it!

The builder is done (Josh made some handy changes a few weeks back that made it very fast to implement).

We’ll try to get it released and into the conda/analysis3 environment ASAP.

Hi Ellie,

The builder (Cmip6Builder) is now available in the conda/analysis3-25.10 environment.

Give it a crack and let us know how you go!

Thank you Charles, I am able to make a datastore now!

I would like to clarify some best practice things though. Here the datastore (fs38_processed_datastore.search(variable="thetao", frequency = '1mon').df) has picked up many files, because the nc files are saved for each ensemble member. But when I do fs38_processed_datastore.search(variable="thetao", frequency = '1mon').to_dask()I only get one field. I don’t know which ensemble member is being loaded now. Am I just not suppose to store different ensemble members in the same folder when using the builders?

Perhaps also a better example dir for the Cmip6 builder would have been /g/data/fs38/publications/CMIP6/ScenarioMIP/CSIRO/ACCESS-ESM1-5/ssp585/ since this follows the file structure of many ensemble members filed separately? I think most CSIRO maintained CMIP data is stored like this. I tried out the Cmip6Builder on this as below to see and ended up with a ValueError: asset list provided is None. Please run `.get_assets()` first error.

path_list = os.listdir('/g/data/fs38/publications/CMIP6/ScenarioMIP/CSIRO/ACCESS-ESM1-5/ssp585')
path_str = ["/g/data/fs38/publications/CMIP6/ScenarioMIP/CSIRO/ACCESS-ESM1-5/ssp585/"+ path_list[i] for i in range(len(path_list))]
builder = Cmip6Builder(
   path= path_str ,
   ensemble=True,
).build()

Can you give me the command you used to build the datastore - is it the same as the one in your last post but with the old path? I’ll rebuild one in my scratch space and see what’s wrong.

I think most likely the issue is that I forgot to add the ensemble keyword to the Cmip6Builder, so that argument is being ignored.

Assuming this is the case, we can fix this & push a bugfix release pretty quickly - probably even today.

%%time

builder = Cmip6Builder(
    path="/g/data/e14/afp599/access-esm/fs38_processed",
    ensemble=False # We could use this to pass multiple paths for different ensemble members "/g/data/e14/afp599/access-esm/post_processed_mw2",
).build()

I realise ensemble = False, but based on this I thought the ensemble members had to be in different folders.

I think we’ll want to change ensemble to True - I don’t think it’ll matter too much right now.

That documentation isn’t too clear - I’ll update it.

I’m taking a look at building that datastore now - I’ll let you know how it goes.

Turns out I jumped the gun a little on this - we’ll need to do a bugfix release.

In the meantime, you should be able to use the test datastore that I used which is here - you should be able to read it, but let me know if I’ve got the permissions wrong: /scratch/tm70/ct1163/ellie_cmip6/experiment_datastore_ensemble.json

Thank you, I’ve asked to join tm70 but if thats not allowed the datastore could probably go on e14? Cheers

I’ve copied the datastore into /scratch/e14/ct1163/ellie-ds/ - lemme know if it works for you!

it does work, and with the correct ensemble members. thank you!!

Hey Ellie,

Could you mark this as resolved if it is (I think I’m right in saying that)?

I’m at the limit of how many topics I can be assigned and need to close out some old ones :sweat_smile:

Cheers!