So the way that the Builder objects that build these ESM-Datastores work is that they open every file in the directory, and extract various metadata needed for cataloguing it - all the fields you see in the datastore, basically.
For variable_cell_methods, (and all the other variable_xyz fields) , these are extracted by looping over all the variables in the files & extracting their attributes.
So, for eg. v variable_cell_methods = [ds['v'].attrs.get("cell_methods", "").
The .get("cell_methods","") means that if no cell methods are found for that variable, an empty string will be used as an empty value - which is where that is coming from. Potentially it would be better to just remove it.
I’m not sure I asked exactly what I was trying to, reading it back. The complication is that I don’t think it’s straightforward to disambiguate based on variable cell methods - imagine the following scenario:
- Imagine we have three files: the first with
v_mean and v_max, the second with v_max and the third with v_mean (this is going to seem a bit contrived, but stay with me).
- All are on the same grid.
- File 1 will have
variable_cell_methods = ['time: mean','time: max'] ; file 2 variable_cell_methods = ['time: max']; file 3 variable_cell_methods = ['time': mean'].
In this scenario, if we try to disambiguate on variable_cell_methods - that is, split up what we consider a dataset based on that - we will wind up telling intake-esm that none of these files can be combined.
I think we would actually want to be able to combine 1 & 2 or 1 & 3, but not 2 & 3.
Arguably this is the safest way of handling things, but it might also a become a bit inconvenient. I was looking through the full catalog & I found a bunch of datasets where it looks like cell methods only on differ things like nv and rho2. For eg. an Eulerian velocity field, we probably wouldn’t want to differentiate into datasets based on whether one file contains a maximum on density surfaces and another doesn’t?
I thiiink the behaviour that we want is something like:
- If a specific variable has been searched for (ie.
esm_datastore.search(variable = 'xyz').to_dask()), then make sure that for each file that we open in that dataset, we ensure all the variables in the dataset have compatible cell methods?
Intake-ESM already has some code that handles this sort of stuff & we could extend that - if you search for variable x it will know to only put x (and necessary coordinate variables) into the dataset it returns from to_dask.