All netcdf_conversion jobs failed from last Saturday

Hi Team

All my netcdf_conversion jobs within ACCESS-AM3 alpha release run failed since last Saturday. Before that time, all jobs succeeded. I am wondering whether there are changes to the gadi python environment, or because too many jobs occupied the python.

It seems netcdf_conversion does not start its job until exceeding walltime.

You can find logs to a failed job here /scratch/public/qg8515/jobf.out, and logs to a succeeded job here /scratch/public/qg8515/jobs.out

Any insights would be appreciated.

Regards, Qinggang

Hi @qinggangg,

can you please also share the error log for the failed run?

Also, the /scratch/public/qg8515/jobs.out path doesn’t seem to exist

EDIT: Found! It’s at /scratch/public/qg8515/.jobs.out

Hi @qinggangg, which branch of the configurations were you using as your start point for this experiment? I can’t replicate your errors from the current dev-n96e branch.

Would you be able to show what additions you’ve made to your .bash_profile? The PATH in jobf.out looks a bit odd. There are some miniconda paths in there, which could be causing conflicts with the um2netcdf4 environment.

Hi @atteggiani Thank you. I copied the succeeded and failed job.err files to the folder /scratch/public/qg8515 as well.

Hi @lachlanswhyborn I added the following two lines in .bashrc files. But it was done a long while ago rather than last Saturday.

module load ncview
module load netcdf

I can also share the whole settings in .bashrc but they are not modified recently and are mainly alias and export commands.

There is also a conda section:

# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/home/563/qg8515/miniconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
    eval "$__conda_setup"
else
    if [ -f "/home/563/qg8515/miniconda3/etc/profile.d/conda.sh" ]; then
        . "/home/563/qg8515/miniconda3/etc/profile.d/conda.sh"
    else
        export PATH="/home/563/qg8515/miniconda3/bin:$PATH"
    fi
fi
unset __conda_setup
# <<< conda initialize <<<

Would it also be easy to do netcdf_conversion offline?

I created my own branch from dev-n96e in a forked repository. I added many output variables which makes the netcdf_conversion job very memory heavy.

Hmm, the netcdf_conversion job normally takes on the order of a few minutes, so it seems unlikely that any reasonable amount of added variables would cause it to take 4 hours. I see you added a couple more log files to that shared scratch directory, which suggest the netcdf conversion was successful- did you make any changes to achieve this?

The successful netcdf_conversion jobs were run before last Saturday, which took around 2.5 hours. From last Saturday, all netcdf_conversion jobs failed by exceeding walltime.

The version of conda/analysis is different between runs- the successful run used conda/analysis-25.08, while the failed run used conda/analysis3-26.01. I know that ants was removed from the conda environments for all past 25.08, as it was placing untenable restrictions on other package versions.

The conda environment is actually loaded when loading pythonlib/um2netcdf4/xp65. This modulefile was updated on Friday afternoon, which lines up with the behaviour you’re seeing. We’ll do some investigation internally to see if we can work out if it actually this change in version causing the problem, and if there’s a way around it.

In the meantime, a temporary workaround may be to add module use /g/data/xp65/public/modules and module load conda/analysis3-25.08 to the NetCDF conversion task in suite.rc, before the module load pythonlib/um2netcdf4/xp65 line. The um2netcdf4 only loads the default conda/analysis if a version of it is not already loaded, so by loading a specific version, you should be able to restore the previous behaviour.

Thank you. I will check now.

Hi @lachlanswhyborn Thank you for the suggestion, but the job failed again by exceeding walltime. I copied the failed and succeeded logs here /scratch/public/qg8515 in netcdf_conversionf and netcdf_conversions for comparison.

Looks like it still used conda/analysis-26.01, so it must be checking for that specific module version rather than any version. Can you try:

  1. Removing the current contents of pre-script in the [[netcdf_conversion]] task, and just have module use /g/data/xp65/public/modules and module load conda/analysis-25.08, so we load the specific version of conda/analysis.
  2. Copy the files um2netcdf4.py and stashvar_cmip6.py from /g/data/access/apps/pythonlib/um2netcdf4/2.1 into the app/netcdf_conversion/file directory in the configuration. The other thing the original um2netcdf4 was add this directory to the PYTHONPATH, but we can just bypass this by adding them to the working path.
  3. Re-run the NetCDF conversion task.

If it still uses conda/analysis-26.01, then I’ll be very confused and will have to call for some backup.

I’ll check now.

This should now be fixed on the default configuration, with a reversion of the um2netcdf4/xp65 modulefile.

Thank you. This issue is fixed for me now.