Nci_era5grib no longer working

Hi Team,

I’m trying to run a suite that has previously worked for me… using ERA5 as the driving model over the Antarctic coast line (I run twice - once un-rotated grid to generate ERA5 files, then rotated grid for the model runs).

The nci_era5grib runs for about 50 mins now and then crashes with no helpful error message (to me at least).

[FAIL] (module use /g/data/hh5/public/modules; module load conda; nci_era5grib.py --mask $MASK --output $OUTDIR --start $START --count $COUNT --freq $FREQ --era5land $ERA5LAND --polar $POLAR) # return-code=1

Does anyone know if any thing has changed within this workflow that is causing it to not run as it used to?

Cheers,
Sonya

To be clear I’m running the UM RNS RAL3.1 plus some big fixes

Try using the previous conda environment, it may be that an upgrade has broken something

Hi @sonyafiddes. Thanks for letting us know. There is an issue with CDO in the current conda/analysis3 environment (Error using '-t ecmwf' - CDO - Project Management Service). Unfortunately this was only discovered after the environment became stable. It was resolved in the unstable environment yesterday. There is some other issue with the newer analysis envs and the underlying era5grib package. Switching back to analysis3/23.01 resolves the issue for now. @Paola-CMS is looking at refactoring era5grib package, which should clear up this issue as well as the cdo dependence.

Thank @Scott and @dale.roberts, I’ve changed the app/nci_era5grib/rose-app.conf file to now module load conda/analysis3-23.01, but it’s still not producing any grib files (its been running for about 20mins and nothing - it should only take a few mins each file…) - have I missed something?

I also changed the site/nci-gadi/suite-adds.rc to use module load conda/analysis3/23.01 too…

Hi @sonyafiddes. I’ve noticed when using nci_era5grib that the runtime can be quite variable. I think the main cause of this is that a) era5grib writes the dataset as netCDF to $TMPDIR in order to have cdo convert it to grib format and b) cylc overrides $TMPDIR such that these temporary netCDF files end up on /scratch or /g/data. I’ve found that if I reset TMPDIR to $PBS_JOBFS in the tasks’s [[[environment]]] section, the runtime is much more consistent.

Hi @dale.roberts, thanks for this.

I’ve set TMPDIR=$PBS_JOBFS under # TASK RUNTIME ENVIRONMENT: in the job file (I then just qsub it manually), but it still only created two grib files in two hours. Have I done something wrong here? Or is there another way to speed this up?

Hi @sonyafiddes. At this stage the only thing that comes to mind is the interpreter line in app/nci_era5grib/bin/nci_era5grib.py. If the first line of nci_era5grib.py looks something like this:

#!/g/data/hh5/public/apps/miniconda3/envs/analysis3/bin/python

then it’ll be using the default analysis3 env regardless of what module is loaded. If you change that to

#!/usr/bin/env python

It’ll use the loaded analysis environment. If its already set to that, then that’s not the problem.

That’s done it! Thanks so much Dale! (PS - it was set to the unstable environment… so maybe an issue for the upcoming one too?)

No worries. Yeah, we’re aware its an issue in everything after 23.04. I think its a dask problem based on the logging I can see when the job runs. Hard to tell though. We’ll keep working on it, being stuck on analysis3-23.01 isn’t sustainable in the long run.

Might be worth checking if Dask has matured enough that the climtas io is no longer required, that goes into the guts of dask so is fragile to updates