Intake loading aice_m as two separate datasets

I’m trying to load the variable aice_m from the 01deg_jra55v140_iaf_cycle4experiment using intake, but intake is returning two separate datasets for this one diagnostic. This is the code I’m using to read the data:

dictionary = catalog['01deg_jra55v140_iaf_cycle4'].search(variable='aice_m').to_dataset_dict(
    xarray_open_kwargs={"chunks": -1 ,"decode_coords": False,},
    xarray_combine_by_coords_kwargs={"compat": "override",
                                     "data_vars": "minimal",
                                     "coords": "minimal",},)

And this is the output:

{'seaIce.1mon.d2:2.nc:5.ni:3600.nj:2700.nkice:4.mean': <xarray.Dataset> Size: 2GB
 Dimensions:  (time: 59, nj: 2700, ni: 3600)
 Coordinates:
   * time     (time) datetime64[ns] 472B 1958-02-01 1958-03-01 ... 2017-01-01
 Dimensions without coordinates: nj, ni
 Data variables:
     aice_m   (time, nj, ni) float32 2GB dask.array<chunksize=(1, 2700, 3600), meta=np.ndarray>
 Attributes: (12/14)
     title:                            sea ice model output for CICE
     contents:                         Diagnostic and Prognostic Variables
     source:                           Los Alamos Sea Ice Model (CICE) Version 5
     time_period_freq:                 month_1
     comment3:                         seconds elapsed into model date:      0
     conventions:                      CF-1.0
     ...                               ...
     intake_esm_attrs:file_id:         seaIce.1mon.d2:2.nc:5.ni:3600.nj:2700.n...
     intake_esm_attrs:frequency:       1mon
     intake_esm_attrs:realm:           seaIce
     intake_esm_attrs:temporal_label:  mean
     intake_esm_attrs:_data_format_:   netcdf
     intake_esm_dataset_key:           seaIce.1mon.d2:2.nc:5.ni:3600.nj:2700.n...,
 'seaIce.1mon.d2:2.nc:5.ni:3600.nj:2700.mean': <xarray.Dataset> Size: 26GB
 Dimensions:  (time: 673, nj: 2700, ni: 3600)
 Coordinates:
   * time     (time) datetime64[ns] 5kB 1960-01-01 1960-02-01 ... 2019-01-01
 Dimensions without coordinates: nj, ni
 Data variables:
     aice_m   (time, nj, ni) float32 26GB dask.array<chunksize=(1, 2700, 3600), meta=np.ndarray>
 Attributes: (12/14)
     title:                            sea ice model output for CICE
     contents:                         Diagnostic and Prognostic Variables
     source:                           Los Alamos Sea Ice Model (CICE) Version 5
     time_period_freq:                 month_1
     comment3:                         seconds elapsed into model date:      0
     conventions:                      CF-1.0
     ...                               ...
     intake_esm_attrs:file_id:         seaIce.1mon.d2:2.nc:5.ni:3600.nj:2700
     intake_esm_attrs:frequency:       1mon
     intake_esm_attrs:realm:           seaIce
     intake_esm_attrs:temporal_label:  mean
     intake_esm_attrs:_data_format_:   netcdf
     intake_esm_dataset_key:           seaIce.1mon.d2:2.nc:5.ni:3600.nj:2700.mean}

The first dataset in this dictionary contains aice_m data from 1958 to 1959 and 2014 to 2017, while the second dataset contains all other time steps. @aekiss and I think this is because in the 1958-1959 & 2014-2017 years of the simulation, certain sea ice diagnostics were saved that used a grid coordinate callednkice. Even though aice_m doesn’t use this coordinate, intake doesn’t seem to recognise this, and instead groups them as separate datasets. Is there a way to make intake recognise this is one single dataset @CharlesTurner ?

Yup, what you said about the cause of the issue is bang on.

This has been a bit of a gripe of mine with our current catalog build system for a while, but a clean solution which doesn’t cause more problems than it solves has been eluding me.

You can (hackily) fix it via (I’ve tested this & it works):

# The catalog you had: 
catalog = intake.cat.access_nri
esm_ds = catalog['01deg_jra55v140_iaf_cycle4'].search(variable='aice_m')

# And rewriting its entries to be loadable as one dataset
df = esm_ds.df
df['file_id'] = df['file_id'].str.replace('seaIce.1mon.d2:2.nc:5.ni:3600.nj:2700.nkice:4', 'seaIce.1mon.d2:2.nc:5.ni:3600.nj:2700')
esm_ds.esmcat._df = df

esm_ds.to_dask(  
    xarray_open_kwargs={"chunks": -1 ,"decode_coords": False,},
    xarray_combine_by_coords_kwargs={"compat": "override",
                                     "data_vars": "minimal",
                                     "coords": "minimal",}
)

which overwrites the file IDs to be compatible.

It would be fairly straightforward to add a merge_datasets or something like that which would handle this for you, which might function as a decent halfway house for the time being.

I’ll scratch my head a bit more and see if I can come up with a cleaner solution!