Issues loading ACCESS-OM2-01 data from cycle 4

Hello everyone! I am reposting here a question I asked via the CLEX slack channel.

I am trying to access sea ice concentration data from the 4th cycle of ACCESS-OM2-01. I am using the COSIMA cookbook to do this, but it is taking a long time to do anything. No data is loaded even after waiting for as long as 20 minutes. I tried restarting the kernel several times, but the problem persists. I also tried to load monthly sea ice concentration data, and I have been waiting for about 15 minutes I still do not have the data loaded into my Jupyter notebook.

For reference, loading data usually took 2-5 minutes using exactly the same method. I have also tried using two different conda environments (22.07 and 22.10) and that has not made a difference either.

I used the following line of code:

import cosima_cookbook as cc
import xarray

session = cc.database.create_session()
sic = cc.querying.getvar(‘01deg_jra55v140_iaf_cycle4’, ‘aice’, session, start_time = ‘1968’)

I am using the gadi_jupyter script to access GADI. As I mentioned before, I have done this before without any issues and did not have to wait this long. Are there any issues reported with GADI or perhaps the COSIMA cookbook? Or should I use ARE instead to run my scripts?

Additionally, there is at least one other person who experienced the same issue while loading data today. They decided to switch tasks because the cookbook was not loading any of the data they needed. Unlike me, they were using ARE to access GADI.

Any help will be appreciated.

Denisse

1 Like

I’m using analysis3-unstable, and this is also taking a really long time for me, and consuming an unreasonable amount of memory (over 25Gb).

CPU times: user 17min 27s, sys: 21min 5s, total: 38min 32s
Wall time: 38min 24s

By adding a few arguments to skip the coordinate verification, this was relatively fast for me (on a login node):

In [42]: %time sis = cc.querying.getvar(
                       "01deg_jra55v140_iaf_cycle4", "aice", s,
                       start_time="1968", compat="override", coords="minimal"
                     )    
CPU times: user 2min 45s, sys: 1min 40s, total: 4min 26s
Wall time: 2min 54s

The long sys time indicates that it spent a long time doing IO, probably because it’s reading 4x2D grids out of every file, and then comparing them all. Maybe this doesn’t come up as badly for the ocean data because it’s not output on a curvilinear grid?

This was brought up before:

With the suggested fix being to use decode_coords=False, as demonstrated in the ice plotting recipe. This gives me very similar timing to above:

In [61]: %time sis = cc.querying.getvar(
                       "01deg_jra55v140_iaf_cycle4", "aice", s,
                       start_time="1968", decode_coords=False
                     )
CPU times: user 2min 39s, sys: 1min 31s, total: 4min 11s
Wall time: 2min 34s

The difference between the two methods is that the first one will give you TLON, TLAT, ULON, ULAT (in addition to time) as coordinates on the resulting DataArray; whereas the second one will only give you a time coordinate – the spatial dimensions just have integer indices.

2 Likes

Wondering if we can test whether this is:

  1. a database problem, in that the cookbook is taking a long time to find the files/metadata; or
  2. a structural problem with the way ice data is saved (either in cycle 4, or more generally?)

@rbeucher and @dougiesquire are looking at ways to improve data catalogues, and it would be nice to know if this problem depends on the database itself …

In this case, it’s option 2: it’s just the way CICE’s sea ice data works (the GitHub issue has a little more technical detail – it would more generally apply to any output with 2D coordinates that get brought in). We could work around it in the cookbook, by doing things like:

  1. different defaults for coords, compat, etc. options for open_mfdataset
  2. default to decode_coords=False (I don’t think this is a good idea)
  3. detect if a variable’s coordinates are 2D at query time and throw a warning, with suggestions for options to speed up the load and/or a link to a notebook demonstrating the difference

Part of the issue is that maybe we can’t (shouldn’t?) necessarily rely on all the files within an experiment being self-consistent to the point that we can use the nested method of concatenating files and assuming that the coordinates all line up. That probably falls more on the data cataloguing side of things.

Yes, I wonder if checking that all the data in an experiment can be concatenated could be something that is done once when the experiment is added to the database? Then compat="override" and coords="minimal" could safely be used by default.