Uncollated ACCESS-CM2 output files

Hi earth system users. I’ve run a year of the u-cy339 suite in ACCESS-CM2 following these instructions. I think the job completed successfully. I.e. this is the job.out file:

However, the output files seem to be uncollated in the work directory (e.g. ocean files are here: /scratch/jk72/hd4873/cylc-run/u-cy339/work/09500101/coupled/OCN_RUNDIR/HISTORY). I have no archive directory on scratch, like the instructions suggest I should have.

Should the output files automatically be collated and moved into an archive directory, or do I need to create the archive directory myself and run a post-processing script for this step?

1 Like

Hi @hrsdawson,

If you followed the linked instructions (you did not change any archive paths) it is strange that you don’t have the archive directory on scratch.

I don’t have access to jk72 yet (I requested it) so I cannot check it myself, but I suspect some tasks did not complete successfully.

Can you please check the coupled/NN/job.err log file to see if there were any errors in the coupled task?

If that shows no error, I might have to wait for the jk72 approval to have a better look of what happened, because I need to check if there are any tasks that ran after coupled and failed (for example I see a filemove task in u-cy339 that might be the one responsible of the move from the work directory to the archive, but to be sure I need to check the files).

Thank you.

Davide

Thanks @atteggiani for following this up. Yep, I followed the instructions quite closely. The only thing I changed was the cycling frequency and run length (to 1 year each, if I recall).

I can see several warnings in the job.err file but no errors.

It seems the ocean_ke_check task failed (see the /scratch/jk72/hd4873/cylc-run/u-cy339/log/suite/log file, line 70-77:

2024-08-26T09:06:05Z CRITICAL - [ocean_ke_check.09500101] status=running: (received)failed/EXIT at 2024-08-26T09:06:03Z for job(01)
2024-08-26T09:06:05Z CRITICAL - [ocean_ke_check.09500101] -job(01) failed
2024-08-26T09:06:05Z WARNING - suite stalled
2024-08-26T09:06:05Z WARNING - Unmet prerequisites for filemove.09500101:
2024-08-26T09:06:05Z WARNING -  * ocean_ke_check.09500101 succeeded
2024-08-26T09:06:05Z WARNING - Unmet prerequisites for housekeep.09500101:
2024-08-26T09:06:05Z WARNING -  * history_postprocess.09500101 succeeded
2024-08-26T09:06:05Z WARNING -  * netcdf_conversion.09500101 succeeded

To find the nature of the problem you’ll have to check the /scratch/jk72/hd4873/cylc-run/u-cy339/log/job/09500101/ocean_ke_check/NN/job.err file:

Traceback (most recent call last):
  File "/home/561/hd4873/cylc-run/u-cy339/share/access-cm2-drivers/src/ocean_ke_check.py", line 5, in <module>
    import sys, netCDF4, argparse
ModuleNotFoundError: No module named 'netCDF4'
2024-08-26T09:06:03Z CRITICAL - failed/EXIT

So it seems that the ocean_ke_check job failed because it cannot find the netCDF4 module in the environment.

This is weird because the runtime suiterc file (/scratch/jk72/hd4873/cylc-run/u-cy339/log/suiterc/20240826T043259Z-run.rc) at line 1287-1291 loads the hh5 conda/analysis3 module (that has the netCDF4 dependency) as a pre-script to the ocean_ke_check task:

pre-script = """
    module use /g/data/hh5/public/modules
    module unload python
    module load conda/analysis3
"""

So I am not sure why it is complaining when trying to find the netCDF4 dependency.
Might need @MartinDix’s help here, sorry.

Thanks for looking into this @atteggiani! Really appreciate it. Hopefully @MartinDix has some further insights (no problem if it’s after the NRI workshop).

1 Like

Hi @atteggiani

What does the batch system line mean:

For [[ocean_ke_check]] :
batch system = background

But for other jobs, batch_system = pbs

I am wondering if background tasks run directly in the persistent session? And maybe the persistent session doesn’t have access to hh5 ?

Thanks

Hi All,

The batch system represents how the job task is submitted by Cylc.

  • background means the job is run locally in the same Cylc shell session (as a background process)
  • pbs means the job is run through a PBS job with the directives specified in the [[directives]].

More info here.

Indeed it runs exactly in the persistent session. However, the persistent session should have access to any project the user that starts it has access to (if I remember correctly).
Might be worth to double check though, I’ll run some tests.

Cheers
Davide

I just tried ssh-ing to the persistent session and I can see all projects as if I was on a Gadi login node.
So I don’t think that’s the problem with this issue.

1 Like

Since I cannot see anything wrong here, I’m going to try and run the configuration myself, and see if I get the error too.

@hrsdawson I am going to run the configuration as you did:

This means I will:

  1. checkout u-cy339
  2. change cycling frequency and run length to 1 year
  3. run the configuration

If you recall doing anything different, please let me know.

Cheers
Davide

Thanks @atteggiani for trying this. I don’t recall making any other changes than those you’ve listed.

1 Like

Hi @hrsdawson,

My run completed without issues.
The ocean_ke_check task completes fine.

Can you please try running the suite again and see if the problem persists?

To avoid weird issues please do the following:

  1. Clean suite data
  2. Delete the suite
  3. Checkout a copy of the suite
  4. Change run length to 1 month
  5. Run the suite

All of the above can be executed with the following commands:

cd ~/roses/u-cy339
rose suite-clean -y
cd && rm -rf u-cy339
rosie checkout u-cy339
sed -i "s|\(^RUNLEN=\).*|\1'P1M'|" ~/roses/u-cy339/rose-suite.conf
rose suite-run -C ~/roses/u-cy339

Thank you.

Cheers
Davide

Okay I’ll try this. Do I need to be running the above from a persistent session? If yes, the same one I created before?

In your failed attempt, did you run your suite from inside a persistent-session?

In general there should never be the need to ssh to a persistent-session (very few exceptions apply).

All the commands above can be run from a Gadi login node (or ARE).

The only important things are:

  1. before loading the cylc7 module (not included in the commands above but needed as a prerequisite to run rose/cylc suites), the file ~/.persistent-sessions/cylc-session exists and has the following line:
    <persistent-session-name>.$USER.$PROJECT.ps.gadi.nci.org.au
    
  2. Before running the suite, a persistent session named <persistent-session-name> must be running.

If you set up your persistent sessions already, there shouldn’t be the need to change anything.
Just check whether the persistent-session is still running: persisten-session list.
If it is not running, start it with persistent-session start <persistent-session-name>

Cheers
Davide

Okay, it’s running now. No, I don’t think I ran the previous one from inside a persistent session. Hopefully it was just some strange one-of last time. I’ll post once it’s finished.

1 Like

@atteggiani the above code is running just 1 month of the model, right? For comparison, how long did this take to run for you?

If you set runtime='P1M' yes.

I ran a full year.
I am not entirely sure because the run stopped because of a server error, so I had to restart it from when it stopped.
I would say in total it took something like 12h.

A 1-month run should complete in ~ 2h I guess.

1 Like

It finished but unfortunately the post-processing failed again with the same error.

The default conda/analysis3 module is loaded each time I log onto Gadi (through my .bashrc file). Could this be interfering somehow?