Hi there, I am trying to build an intake datastore using someone else’s existing ESM1.5 experiments. The hope is to be able to build on existing code with this. However, I get this error below, and it also fails with the same error when I do it through the terminal, as suggested by Charles here. Could I please check if I’m misunderstanding something or if something is missing? Thank you!!
%%time
builder = AccessEsm15Builder(
path="/g/data/e14/afp599/access-esm/fs38_processed",
ensemble=False # We could use this to pass multiple paths for different ensemble members "/g/data/e14/afp599/access-esm/post_processed_mw2",
).build()
--------------------------------------------------------------------------
ParserError Traceback (most recent call last)
File <timed exec>:4
File /g/data/xp65/public/apps/med_conda/envs/analysis3-25.08/lib/python3.11/site-packages/access_nri_intake/source/builders.py:203, in BaseBuilder.build(self)
198 def build(self):
199 """
200 Builds a datastore from a list of netCDF files or zarr stores.
201 """
--> 203 self.get_assets().validate_parser().parse().clean_dataframe()
205 return self
File /g/data/xp65/public/apps/med_conda/envs/analysis3-25.08/lib/python3.11/site-packages/access_nri_intake/source/builders.py:191, in BaseBuilder.validate_parser(self)
188 validate_against_schema(info, ESM_JSONSCHEMA)
189 return self
--> 191 raise ParserError(
192 f"""Parser returns no valid assets.
193 Try parsing a single file with Builder.parser(file)
194 Last failed asset: {asset}
195 Asset parser return: {info}"""
196 )
ParserError: Parser returns no valid assets.
Try parsing a single file with Builder.parser(file)
Last failed asset: /g/data/e14/afp599/access-esm/fs38_processed/wfo_Omon_ACCESS-ESM1-5_ssp585_r9i1p1f1_2015-2100_r360x180.nc
Asset parser return: {'INVALID_ASSET': '/g/data/e14/afp599/access-esm/fs38_processed/wfo_Omon_ACCESS-ESM1-5_ssp585_r9i1p1f1_2015-2100_r360x180.nc', 'TRACEBACK': 'Traceback (most recent call last):\n File "/g/data/xp65/public/apps/med_conda/envs/analysis3-25.08/lib/python3.11/site-packages/access_nri_intake/source/builders.py", line 663, in parser\n match_groups = re.match(r".*/([^/]*)/history/([^/]*)/.*\\.nc", file).groups()\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\nAttributeError: \'NoneType\' object has no attribute \'groups\'\n'}
Typically this error results from files being named in ways that the builder doesn’t expect. I’ve requested to join e14 & I’ll update when I can check exactly what the issue is.
Okay, I’ve taken a look - these are all CMIP formatted data, which the ESM1.5 Builder doesn’t recognise the name patterns for - we fall back to using the filenames to work out things like frequencies.
I think we can pretty straightforwardly add a builder which handles CMIP formatted data - it’s a fairly minimal job now. I’m a bit tied up today - @joshuatorrance are you able to take a look at this?
also cc @dougiesquire, if you think adding a CmipBuilder is a terrible idea for any particular reason can you let us know?
I think adding this builder would be something of a stopgap measure - we’re aiming to completely break the filename dependency - but I don’t see any harm in having a builder tailored to this format of output in the meantime.
The builder is expecting to see a path that looks something like this one: /g/data/p73/archive/non-CMIP/ACCESS-ESM1-5/PI-GWL-B2035/history/atm/netCDF/*.nc
The builder’s regex is looking at the directories before and after history to determine the experiment_id and realm for each file.
Since this is CMIP data we can presumably pull the info we need straight out of each file’s metadata. Adding a Builder for CMIP is probably worthwhile and shouldn’t be difficult.
I’ve applied for e14 too, I’ll start on a CMIP/CMIP6 builder. EDIT: Charles beat me to it!
Thank you Charles, I am able to make a datastore now!
I would like to clarify some best practice things though. Here the datastore (fs38_processed_datastore.search(variable="thetao", frequency = '1mon').df) has picked up many files, because the nc files are saved for each ensemble member. But when I do fs38_processed_datastore.search(variable="thetao", frequency = '1mon').to_dask()I only get one field. I don’t know which ensemble member is being loaded now. Am I just not suppose to store different ensemble members in the same folder when using the builders?
Perhaps also a better example dir for the Cmip6 builder would have been /g/data/fs38/publications/CMIP6/ScenarioMIP/CSIRO/ACCESS-ESM1-5/ssp585/ since this follows the file structure of many ensemble members filed separately? I think most CSIRO maintained CMIP data is stored like this. I tried out the Cmip6Builder on this as below to see and ended up with a ValueError: asset list provided is None. Please run `.get_assets()` first error.
path_list = os.listdir('/g/data/fs38/publications/CMIP6/ScenarioMIP/CSIRO/ACCESS-ESM1-5/ssp585')
path_str = ["/g/data/fs38/publications/CMIP6/ScenarioMIP/CSIRO/ACCESS-ESM1-5/ssp585/"+ path_list[i] for i in range(len(path_list))]
builder = Cmip6Builder(
path= path_str ,
ensemble=True,
).build()
Can you give me the command you used to build the datastore - is it the same as the one in your last post but with the old path? I’ll rebuild one in my scratch space and see what’s wrong.
I think most likely the issue is that I forgot to add the ensemble keyword to the Cmip6Builder, so that argument is being ignored.
Assuming this is the case, we can fix this & push a bugfix release pretty quickly - probably even today.
%%time
builder = Cmip6Builder(
path="/g/data/e14/afp599/access-esm/fs38_processed",
ensemble=False # We could use this to pass multiple paths for different ensemble members "/g/data/e14/afp599/access-esm/post_processed_mw2",
).build()
I realise ensemble = False, but based on this I thought the ensemble members had to be in different folders.
Turns out I jumped the gun a little on this - we’ll need to do a bugfix release.
In the meantime, you should be able to use the test datastore that I used which is here - you should be able to read it, but let me know if I’ve got the permissions wrong: /scratch/tm70/ct1163/ellie_cmip6/experiment_datastore_ensemble.json