Intake (bug?)

Hi! I’ve run into a silent problem in intake. I was trying to load a variable, v, which in the experiment I was using, comes with 2 different “cell methods”: ['time: mean', 'time: mean_pow(02)']

I was not aware and intake did not prompt me with a warning or error, it loaded it no problem and averaged them all together. The result was a very weird looking diagnostic - I noticed it was not OK, but the student I was working with had not. It would be great if intake failed and prompted an error instead of collating diagnostics without the user knowing. Or maybe a warning? “Careful, you are averaging X number of variables”

Minimal example here: intake_problem.ipynb · GitHub

This is very scary.

I’ll get right on this and get a fix out ASAP.

Thanks for the very clear example notebook.

In this case, I think we can solve the problem by forcing intake to disambiguate on cell methods. Do you have any idea whether that is going to work in general - eg. do we reliably know that if two files contain:

  1. The same variable - v in this instance
  2. On the same grid
  3. But with different cell methods: ['time: mean', 'time: mean_pow(02)'] in this instance,

That they are in fact, and always will be (or should at least default to being) different
variables?
I think this should be the case, right?

Thank you Charles!!

In this case I reliably know it is the same variable in the same grid. There’s just one cell_id, one vin the variable list and 3 (?) cell methods actually. One of them is just empty ’ ‘and I have no idea what it means.

I am trying to think if there would be any exceptions for this, but I don’t know.

There’s another possible (but unlikely) case that comes to mind that is for the time dimension. It is possible to output diagnostics as an average in time or as a snapshot. If you happened to have a variable that has daily mean and daily snapshot available, I don’t think the 1dayargument would distinguish them? Having said that, I am not aware of any simulations right now that have such diagnostics outputted. But it could happen in the future.

So the way that the Builder objects that build these ESM-Datastores work is that they open every file in the directory, and extract various metadata needed for cataloguing it - all the fields you see in the datastore, basically.

For variable_cell_methods, (and all the other variable_xyz fields) , these are extracted by looping over all the variables in the files & extracting their attributes.

So, for eg. v variable_cell_methods = [ds['v'].attrs.get("cell_methods", "").

The .get("cell_methods","") means that if no cell methods are found for that variable, an empty string will be used as an empty value - which is where that is coming from. Potentially it would be better to just remove it.

I’m not sure I asked exactly what I was trying to, reading it back. The complication is that I don’t think it’s straightforward to disambiguate based on variable cell methods - imagine the following scenario:

  1. Imagine we have three files: the first with v_mean and v_max, the second with v_max and the third with v_mean (this is going to seem a bit contrived, but stay with me).
  2. All are on the same grid.
  3. File 1 will have variable_cell_methods = ['time: mean','time: max'] ; file 2 variable_cell_methods = ['time: max']; file 3 variable_cell_methods = ['time': mean'].

In this scenario, if we try to disambiguate on variable_cell_methods - that is, split up what we consider a dataset based on that - we will wind up telling intake-esm that none of these files can be combined.

I think we would actually want to be able to combine 1 & 2 or 1 & 3, but not 2 & 3.

Arguably this is the safest way of handling things, but it might also a become a bit inconvenient. I was looking through the full catalog & I found a bunch of datasets where it looks like cell methods only on differ things like nv and rho2. For eg. an Eulerian velocity field, we probably wouldn’t want to differentiate into datasets based on whether one file contains a maximum on density surfaces and another doesn’t?

I thiiink the behaviour that we want is something like:

  • If a specific variable has been searched for (ie. esm_datastore.search(variable = 'xyz').to_dask()), then make sure that for each file that we open in that dataset, we ensure all the variables in the dataset have compatible cell methods?

Intake-ESM already has some code that handles this sort of stuff & we could extend that - if you search for variable x it will know to only put x (and necessary coordinate variables) into the dataset it returns from to_dask.

That sounds good to me Charles.

1 Like

Once I have a working implementation of this (turns out to irritatingly complicated to implement this) I’ll generate a demo of the updated behaviour & give you a ping to check it’s exactly what it should be if thats okay!

2 Likes

Thank you Charles! Sorry I missed the messages above!

Sorry this is taking a while!

I’ve rebuilt a catalog against a fork with the necessary changes here, which basically ‘explodes’ iterable columns (somewhat amusingly, that’s actually a technical term).

The TLDR; is that previously, all cell methods would be attached to the single netCDF file entry in the catalog - this made searching on specific cell methods a bit of a faff, and if eg. u had time: mean and v had time: mean_pow(02) in the same file, there would be no way of separating the two out - because they’re attached to the same entry.

Now we’ve exploded out all iterables, we have one entry per variable in each catalog, which makes the catalogs look a bit bigger, but just splits the data up in a bit more of a fine grained way.

I’ve looked into changing the way that datasets are grouped, and I’m not sure it’s going to make sense to force all datasets to have the same variable_cell_methods as it will split up a bunch of datasets that should be together - inour tests, it seems that adding that split datastores up to be ~ 1 dataset per file - which is no real good to anyone.

Due to the way our test data is constructed, the only way to really find out whether this is in fact going to be a major issue is to build a full catalog & check - which is ongoing. I’m hoping the test data is misleadingly making it look like more of an issue than it really is, as they’re designed to have minimal duplication, so it’s plausible it’s just the tests making things look bad.

If it turns out that it does split things too much, we’ll have to update intake-esm to warn if multiple cell methods are merged.

I’ll update when I have all the necessary rebuilt catalogs to assess what the best path forward is!

I think this is above my understanding capacities these days :sweat_smile: but a solution underway sounds awesome! Thank you