Building a COSIMA dataset on time-averaged files

avivsolo · 13 February 2023 08:23

Hi All
I am monthly-averaging daily ACCESS diagnostics and saving to files (1 sample per output file). Then I add to file time and time_bounds variables and relevant attributes. I can proceed to build a cosima dataset without any error messages, but using that dataset to load the actual variables fails with the following message:
ValueError: Could not find any dimension coordinates to use to order the datasets for concatenation.
The same error occurs whether I specify variable frequency or ncfile = '%monthly%' in the getvar function. I provide detail below, as well as in a notebook which can be found here:
/home/552/as2408/NotebookShare/DatabaseTest_V1.ipynb. If anyone wants to actually run the notebook parts where files are modified I can put clean netcdf files somewhere. Any advice would be appreciated.

This is how the averaging is done per file, pretty straightforward (with some obvious variable definitions):

ds = xr.open_dataset(FileNameIn);#Loading a file with 3 months of daily data
ds = ds.sel(time=time_slice) #Limiting to the desired month
Var=ds[varname].mean('time'); #choosing one variable of interest
Var.to_netcdf(FileNameOut) #saving to file

This is an example of how I add time variables to one file:

fldBinnedTest = '/g/data/g40/access-om2/archive/01deg_jra55v140_iaf_cycle3_antarctic_tracers/test/exptest/'
month=6; monthstr = str(month).zfill(2); year = 1981; yearstr = str(1981)
fndate = yearstr+'_'+monthstr
fn0 = 'passive_adelie-monthly-mean-ym_'+fndate+'.nc';#fn0 = 'passive_adelie-monthly-mean-ym_1981_04.nc'

from netCDF4 import Dataset
dsf = Dataset(fldBinnedTest+fn0, 'r+')#, format='NETCDF4_CLASSIC', diskless=True)
date1 = yearstr+'-'+str(month).zfill(2)+'-15'
print(date1); 
t = pd.date_range(start=date1, end=date1, periods=1)
print(t)

from datetime import date
d0 = date(1900, 1, 1)
d1 = date(year, month, 15)
delta = d1 - d0
print(delta.days)

dsf.createDimension('time', None)
time = dsf.createVariable('time', np.int32, ('time',))
time[:] = delta.days
time.standard_name = 'time'  # Optional
time.long_name = "time";
time.units = "days since 1900-01-01 00:00:00";
time.cartesian_axis = "T";
time.calendar_type = "GREGORIAN" ;
time.calendar = "GREGORIAN";
time.bounds = "time_bounds";

d1 = date(year, month, 1); delta = d1 - d0; 
time_bound_1 = delta.days
if month<12:
    d1 = date(year, month+1, 1); delta = d1 - d0; 
else:
    d1 = date(year+1, 1, 1); delta = d1 - d0; 
time_bound_2 = delta.days
print(time_bound_1); print(time_bound_2)


dsf.createDimension('nv', 2)
time_bounds = dsf.createVariable('time_bounds', np.int32, ('time','nv'))
time_bounds[:] = [time_bound_1,time_bound_2]
time_bounds.standard_name = 'time_bounds'
time_bounds.long_name = 'time_bounds'
time_bounds.units = "days since 1900-01-01 00:00:00";
time_bounds.calendar = 'GREGORIAN'

dsf.close()

Finally, I build a dataset, e.g.,

sessionTestTime = cc.database.create_session(FoldDBs+'Exp9tracersTestTime.db')#
cc.database.build_index(fldBinnedTest,sessionTestTime)

which ends without error and also shows up without error in the explorer, except the times (and cell_methods) are not identified:

And as I wrote above, getvar fails whether I use the following forms or some variations of them:

passive_adelie = cc.querying.getvar(expt='exptest', variable='passive_adelie', 
                          session=sessionTestTime, frequency='1 monthly',
                          start_time='1980-01-01',  end_time='1995-09-09')
passive_adelie = cc.querying.getvar(expt='exptest', variable='passive_adelie', 
                          session=sessionTestTime, ncfile = '%monthly%')

e.g.,

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[13], line 1
----> 1 passive_adelie = cc.querying.getvar(expt='exptest', variable='passive_adelie', 
      2                           session=sessionTestTime, frequency='1 monthly')

File /g/data/hh5/public/apps/miniconda3/envs/analysis3-22.10/lib/python3.9/site-packages/cosima_cookbook/querying.py:368, in getvar(expt, variable, session, ncfile, start_time, end_time, n, frequency, attrs, attrs_unique, return_dataset, **kwargs)
    364     return d[variables]
    366 ncfiles = list(str(f.NCFile.ncfile_path) for f in ncfiles)
--> 368 ds = xr.open_mfdataset(
    369     ncfiles,
    370     parallel=True,
    371     combine="by_coords",
    372     preprocess=_preprocess,
    373     **xr_kwargs,
    374 )
    376 if return_dataset:
    377     da = ds

File /g/data/hh5/public/apps/miniconda3/envs/analysis3-22.10/lib/python3.9/site-packages/xarray/backends/api.py:1026, in open_mfdataset(paths, chunks, concat_dim, compat, preprocess, engine, data_vars, coords, combine, parallel, join, attrs_file, combine_attrs, **kwargs)
   1013     combined = _nested_combine(
   1014         datasets,
   1015         concat_dims=concat_dim,
   (...)
   1021         combine_attrs=combine_attrs,
   1022     )
   1023 elif combine == "by_coords":
   1024     # Redo ordering from coordinates, ignoring how they were ordered
   1025     # previously
-> 1026     combined = combine_by_coords(
   1027         datasets,
   1028         compat=compat,
   1029         data_vars=data_vars,
   1030         coords=coords,
   1031         join=join,
   1032         combine_attrs=combine_attrs,
   1033     )
   1034 else:
   1035     raise ValueError(
   1036         "{} is an invalid option for the keyword argument"
   1037         " ``combine``".format(combine)
   1038     )

File /g/data/hh5/public/apps/miniconda3/envs/analysis3-22.10/lib/python3.9/site-packages/xarray/core/combine.py:982, in combine_by_coords(data_objects, compat, data_vars, coords, fill_value, join, combine_attrs, datasets)
    980     concatenated_grouped_by_data_vars = []
    981     for vars, datasets_with_same_vars in grouped_by_vars:
--> 982         concatenated = _combine_single_variable_hypercube(
    983             list(datasets_with_same_vars),
    984             fill_value=fill_value,
    985             data_vars=data_vars,
    986             coords=coords,
    987             compat=compat,
    988             join=join,
    989             combine_attrs=combine_attrs,
    990         )
    991         concatenated_grouped_by_data_vars.append(concatenated)
    993 return merge(
    994     concatenated_grouped_by_data_vars,
    995     compat=compat,
   (...)
    998     combine_attrs=combine_attrs,
    999 )

File /g/data/hh5/public/apps/miniconda3/envs/analysis3-22.10/lib/python3.9/site-packages/xarray/core/combine.py:629, in _combine_single_variable_hypercube(datasets, fill_value, data_vars, coords, compat, join, combine_attrs)
    623 if len(datasets) == 0:
    624     raise ValueError(
    625         "At least one Dataset is required to resolve variable names "
    626         "for combined hypercube."
    627     )
--> 629 combined_ids, concat_dims = _infer_concat_order_from_coords(list(datasets))
    631 if fill_value is None:
    632     # check that datasets form complete hypercube
    633     _check_shape_tile_ids(combined_ids)

File /g/data/hh5/public/apps/miniconda3/envs/analysis3-22.10/lib/python3.9/site-packages/xarray/core/combine.py:149, in _infer_concat_order_from_coords(datasets)
    144             tile_ids = [
    145                 tile_id + (position,) for tile_id, position in zip(tile_ids, order)
    146             ]
    148 if len(datasets) > 1 and not concat_dims:
--> 149     raise ValueError(
    150         "Could not find any dimension coordinates to use to "
    151         "order the datasets for concatenation"
    152     )
    154 combined_ids = dict(zip(tile_ids, datasets))
    156 return combined_ids, concat_dims

ValueError: Could not find any dimension coordinates to use to order the datasets for concatenation

angus-g · 14 February 2023 03:00

I don’t think your issue has anything to do with the cookbook, it’s just how you’ve added the time dimension to your NetCDF files. If I try to open them directly with xr.open_dataset, it fails with an error like:

ValueError: Failed to decode variable 'time': unable to decode time units 'days since 1900-01-01 00:00:00' with "calendar 'GREGORIAN'". Try opening your dataset with decode_times=False or installing cftime if it is not installed.

But actually if you go back up to the root error, it’s

OverflowError: Python int too large to convert to C long

This is because you’ve created the time dimension as an int32, which seems to stay as a fairly restricted type down the decoding chain. I think you should make it a double (that’s how it is in the files output directly from the model).

Having said that, I don’t know what your workflow is for generating these passive_adelie-monthly-mean... files – maybe there’s a way to retain the time information at some prior step so that you don’t have to manually add it back in.

avivsolo · 14 February 2023 08:12

Thanks Angus. I’m not sure why open_dataset returned an error for you. For me it has been working fine with these files and I verified again now that is the case. Anyway, I remade them with the time and time_bounds as type double, and they can be found here: /g/data/g40/access-om2/archive/01deg_jra55v140_iaf_cycle3_antarctic_tracers/test/exptest/output444/ocean
I deleted and rebuilt the dataset but getting the same error unfortunately. Perhaps adding average_T1/T2 variables would do it. I’ll try that.

Before the present attempt to add the time information properly in the notebook, the files were created without time information using a python script that is accessible here: /home/552/as2408/NotebookShare/TimeAvgZ_Fold_V4.py. I time averaged and saved to file one diagnostic variable from an opened dataset, but can switch to averaging and saving the whole dataset if there is a way to do that while retaining the full time related variables/attributes information, which I could not find.

avivsolo · 14 February 2023 09:01

I have now added the variables average_T1,average_T2,average_DT and relevant attributes to the passive_adelie variable based on the way they appeared in the original dataset before monthly-averaging and adjusted to the monthly average (some of these variables were defined above in the thread):

average_T1 = dsf.createVariable('average_T1', np.double, dimensions=('time'),fill_value=1.e+20)
average_T1[:] = time_bound_1
average_T1.long_name = "Start time for average period";
average_T1.units = "days since 1900-01-01 00:00:00";
average_T1.missing_value = 1.e+20
average_T2 = dsf.createVariable('average_T2', np.double, dimensions=('time'),fill_value=1.e+20)

average_T2[:] = time_bound_2
average_T2.long_name = "End time for average period";
average_T2.units = "days since 1900-01-01 00:00:00";
average_T2.missing_value = 1.e+20

average_DT = dsf.createVariable('average_DT', np.double, dimensions=('time'),fill_value=1.e+20)
average_DT[:] = time_bound_2 - time_bound_1
average_DT.long_name = "Length of average period";
average_DT.units = "days";
average_DT.missing_value = 1.e+20

p = dsf['passive_adelie']
p.long_name = "passive (adelie)"
p.units = "dimensionless"
p.cell_methods = "time: mean"
p.time_avg_info = "average_T1,average_T2,average_DT"

dsf.close()

I have rebuilt the dataset, but I’m still getting the same error, i.e., Error loading variable passive_adelie data: Could not find any dimension coordinates to use to order the datasets for concatenation. Any suggestions?

angus-g · 14 February 2023 23:53

Thanks for checking that out. Strangely I still can’t open your files using the analysis3-unstable environment with xarray unless I reset the time dimension:

ncap2 -s 'time=double(time)' in.nc out.nc

Even using doubles for time gives the same overflow error. Maybe something is encoding the files in an incompatible way.

Anyway, I think the issue is more likely that xarray can’t figure out how to concatenate your variables, as the error message suggests:

        float passive_adelie(st_ocean, yt_ocean, xt_ocean) ;
                passive_adelie:_FillValue = NaNf ;
                passive_adelie:long_name = "passive (adelie)" ;
                passive_adelie:units = "dimensionless" ;
                passive_adelie:cell_methods = "time: mean" ;
                passive_adelie:time_avg_info = "average_T1,average_T2,average_DT" ;

Because there’s no time dimension here, it doesn’t know what to do.

Here’s a little example of how you could maintain more of the coordinate information in xarray, so you don’t have to resort to manually fiddling with the file after writing it out:

import xarray as xr
d = xr.open_dataset("passive_adelie.nc")
dm = (d
    .passive_adelie.mean("time", keep_attrs=True)
    .assign_coords(time=d.time.isel(time=0)) # or whatever your time value should be
    .expand_dims("time")
)

Also, you don’t strictly need the average_T1, average_T2 or average_DT variables for anything to do with the cookbook or xarray; they’re just there to give you the full picture when you’re doing analysis.

avivsolo · 16 February 2023 09:06

I’ll recreate the files in this manner Angus. Thanks!

avivsolo · 19 February 2023 11:17

Update: this has worked in that the created files are readable from the explorer or getvar, with the caveat that their time_frequency appears as “static”. Selecting end/start_time still appears to work well though.

Aidan · 19 February 2023 22:12

Can you please mark the relevant reply as the solution, that way it makes it easier for others who might have the same problem to find the answer. Thanks!

avivsolo · 20 February 2023 14:52

Correction. Using cc.querying.getvar with frequency=‘static’ on the dataset built in above solution only loads one file i.e. month. To specify start_time and end_time I find I need to speicfy that the saved files contain the string monthly via the getvar argument ncfile = ‘%monthly%’, as in here:

github.com/COSIMA/cosima-cookbook

one time level per file classified as "static"

opened 05:35AM - 11 Aug 22 UTC

aekiss

🐞 bug

Just copying this from [Slack](https://arccss.slack.com/archives/C6PP0GU9Y/p1658…895584112989) so we don't lose it: Some of the BGC output in [IAF cycle 4](https://github.com/COSIMA/01deg_jra55_iaf/issues/11) is saved as one time level per file. In these cases the COSIMA Cookbook indexing considers it "static", so the `frequency` filter in `cc.querying.getvar` doesn't work. E.g. this returns nothing: ``` data = cc.querying.getvar('01deg_jra55v140_iaf_cycle4', 'phy', session, frequency='1 monthly') ``` A workaround is to filter on the filename, e.g. ``` data = cc.querying.getvar('01deg_jra55v140_iaf_cycle4', 'phy', session, ncfile = '%monthly%') ``` but it would be nice not to have to resort to this, because the output frequency isn't included in the filename for many of the experiments.

.
For example:

passive_amundsen = cc.querying.getvar(expt=expt, variable='passive_amundsen',
                          session=sessionTestTime,ncfile = '%monthly%',
                          start_time='1995-01', 
                          end_time='1997-09')

angus-g · 21 February 2023 01:21

Sorry, I think I misled you on this. I’d forgotten that the time bounds are used in one instance: to determine the temporal frequency of a file with only a single timestep. Obviously this would be useful in this case! I’m glad you found the workaround to match only the relevant files instead.

avivsolo · 21 February 2023 06:39

Thanks Angus! That only worked after the modification in the file creation that you recommended btw.

Topic		Replies	Views
Issues loading ACCESS-OM2-01 data from cycle 4 Technical	5	541	15 February 2023
Job fails trying to access /scratch while the database is on /g/data COSIMA	18	532	16 January 2023
Issues loading ACCESS-OM2-01 daily data from RYF run Technical	6	306	20 May 2023
"database disk image is malformed" // COSIMA Cookbook failure? COSIMA cosima , cosima-cookbook	7	247	12 October 2023
Xarray warnings while loading data using cosima cookbook Technical python , help , cosima	16	496	2 December 2024

Building a COSIMA dataset on time-averaged files

Related topics