As you can see, it is pulling from both the mean and pow02 files, so the resulting u is actually 1/4 what it should be. There is no error thrown for this.
The workaround I found is to use xr.open_mfdataset instead:
This then works. I think @JuliaN was having this issue too? It might be worth checking if this error exists in other runs (e.g. RYF) and also for other variables (e.g. v, w).
It would be nice if Intake provided some warning that the reduction method is inconsistent between files, or (even better) halt with an error message that disambiguation is needed.
it’d be useful to have a large red warning about that in the Intake tutorial while this is being fixed - I imagine students might easily get by without noticing the problem ( we only noticed because we know how the velocities in the specific analysis we are doing should look like)
One of the inevitable trade-offs with automating data reading and removing some of the middle steps is that it reduces the amount that the user needs to know about the data and the number of points where a visual check might be made.
Given this type of problem recurs, do we need to think more deeply as a community about the types of additional safeguards that could be included in automated data-reading like intake?
My tendency is to go verbose. For instance, we could print the unique file paths that intake is pulling from when calling a variable, e.g.
Now loading:
/g/data/cj50/access-om2/raw-output/access-om2-01/01deg_jra55v140_iaf_cycle4/output732/ocean/ocean-3d-u-1-monthly-mean-ym*
/g/data/cj50/access-om2/raw-output/access-om2-01/01deg_jra55v140_iaf_cycle4/output732/ocean/ocean-3d-u-1-monthly-pow02-ym*
But admittedly that might be too narrow a solution for a bigger problem?
Sorry, yeah, this is a bit of a nasty problem that we’re working on. The solution was actually done a while ago but got blocked behind some complicated build related upgrades and a bunch of other urgent stuff. Since this has reared it’s ugly head again, I’m going to revert all of those changes so we can push the fix through & then we’ll reapply the changes as we go.
As a bit of a stopgap (and this was meant to be held back a bit longer until we were happier with how it works, so please don’t ignore the big yellow warning banner), I’ve also been working on a better way to deal with accessing & sharing data through the intake catalog, which you can find at interactive-catalog-spa (currently - it’ll move somewhere more official soon). Most of the complicated build issues this fix got stuck within were related to changing infrastructure to make this viewer possible - I’m hoping it will be a step change in intake usability.
In this online viewer tool, the filters you can apply will reflect what remains in the filtered dataset, so it should be more straightforward to check/verify this issue. I’ll add a warning if .to_dask() returns >1 variable_cell_methods options quickly this morning - should be a ~30 minute job.
Touch wood, fixing the variable cell methods stuff in catalog should be done by early next week - the fix is mostly done, there’s just a lot of related groundworks to cover.
EDIT: The online catalog explorer I linked should now tell you whether you are going to run into variable_cell_methods issues.
Please let me know if you get any unexpected behaviours. I’m just going through and updating the docs today.
TLDR; Catalogs now have a temporal_sample field that should disambiguate these time aggregations. This is a quite far reaching fix, so there is some potential for weird edge cases that I haven’t found. However, it should now be safe by default!
The online catalog viewer I linked above won’t mirror these changes quite yet, but should soon - realistically probably January. It’s still a work in progress & I haven’t set it up to auto mirror yet, so just be aware that it’s now temporarily out of date. Once we’re happy with it, we’ll do a proper release to let everyone know.