ACCESS-AM3 1.0 Alpha Feedback

This topic is a catch-all location for feedback for the ACCESS-AM3 1.0 Alpha Release.

Please reply to this topic if you have feedback on the ACCESS-AM3 Alpha. We are primarily looking for feedback on the usability of the build system and configuration, and the documentation. We are also happy to receive science-related feedback, which we will address later in the release process. Feedback can be to point out problems encountered, or positive to highlight what worked well.

If your feedback is involved, please make an issue on the configuration repository (see this post if you do not yet have access to this repository).

If you’re not sure, reply here and your query can be moved to a GitHub issue if required.

I was able to run the model following the instructions with minor changes (see bellow) :tada:

A few comments:

README in dev-n96e:

Also, following the instructions to change changed projects on rose-suite.conf_nci_gadi didn’t work. The suite fails with:

[FAIL] [Errno 13] Permission denied: '/scratch/tm70'

I had to change projects also it in rose-suite.conf.

2 Likes

Thanks for the work again. I could run the original suite and one with modifications. I spot a few issues.

  1. The jobs atmos_main and netcdf_conversion fail irregularly, once in a few years. For example see /home/563/qg8515/scratch/cylc-run/access-am3-configs/log/job/19911101T0000Z/atmos_main/01 and /home/563/qg8515/scratch/cylc-run/am3-plus4k/log/job/19860601T0000Z/netcdf_conversion/01. The log files do not provide much information. I am not sure whether it is a gadi problem or I output too many variables. Anyway, it would succeed after rerun again.

  2. Unfortunately, the failed jobs do not resubmit themselves, so I have to babysit them and trigger a rerun after they fail. And they also do not send a email notification about failure. I thought I could set execution retry delays in the file /home/563/qg8515/roses/access-am3-configs/site/nci_gadi.rc as 10*PT1M. But it does not work.

  3. I tried to set EXPT_AEROSOLS='aeroclim' in rose-suite.conf to run climatological aerosols. It again failed without much information (just segmentation fault). @clairecarouge already helped to look into it, but still unresolved. The suite is here: /home/563/qg8515/roses/am3-climaerosol, and the log is here: /home/563/qg8515/scratch/cylc-run/am3-climaerosol/log/job/19820101T0000Z/atmos_main/01. I am working into it. If you have any ideas, I’m happy to implement.

The error asociated to netcdf_conversion is:

FATAL:   container creation failed: mount /proc/self/fd/10->/opt/nci/singularity/3.11.3/var/singularity/mnt/session/overlay-images/0 error: while mounting image /proc/self/fd/10: failed to find loop device: could not attach image file to loop device: failed to attach loop device: transient error, please retry: resource temporarily unavailable

It happens to me a few times today. I don’t think there is a problem with the suite, it can be annoying but if you trigger the job again it should work.

I may be wrong but I think this is for when the process fails to be submitted to the queue. Not for when the process fails.

I agree with Pao for issues 1 and 2:

The container creation failed error is something that I have experienced with all UM suites I’ve run on gadi. It seems to be a persistent transient (and annoying! Scott can confirm) error, but not an issue with the individual suite itself.

If you manually re-trigger the job it should usually then run to success. (In my experience this can even sometimes take a few tries).

Cylc mon is a way to monitor and trigger in a terminal (if you have shut down your gui or don’t want to view it in the gui).

Containers like the xp65 environments use ā€˜loop devices’ to load. Depending on which node you get put on there can be a limited number of these loop devices available, some have a couple hundred some have only 12.

[saw562@gadi-cpu-clx-2854 tmp]$ ls /dev/loop*
/dev/loop-control  /dev/loop0  /dev/loop1  /dev/loop10  /dev/loop11  /dev/loop2  /dev/loop3  /dev/loop4  /dev/loop5  /dev/loop6  /dev/loop7  /dev/loop8  /dev/loop9

If you get put on a node with only a few loop devices available then there may not be a slot available for you and mounting the container will fail.

You can see how the loop devices are being used with

$ losetup --list
NAME         SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE                                                                 DIO LOG-SEC
/dev/loop1 11380555776  36864         1  1 /g/data/dk92/apps/pet/2025.08/libexec/pet:2025.08.sif                       0     512
/dev/loop4  9030176768  40960         1  1 /g/data/dk92/apps/NCI-data-analysis/2024.05/libexec/nci-data-analysis.sif   0     512
/dev/loop2 13988851712      0         1  1 /g/data/xp65/public/apps/med_conda/envs/analysis3-25.10.sqsh                0     512
/dev/loop0        8192  40960         1  1 /g/data/xp65/public/apps/med_conda/etc/base.sif                             0     512
/dev/loop3 14036488192      0         1  1 /g/data/xp65/public/apps/med_conda/envs/analysis3-25.11.sqsh                0     512
4 Likes

Thank you all. execution retry delays works as I expect for rAM3, just not yet for AM3.

Talking to the NCI folks the loop devices should get automatically created by the container so it shouldn’t matter how many are listed before you load xp65. Something to try is to increase the number of cpus requested so there are less jobs on a single node.

1 Like

Interesting, thanks Scott.

Hi @MartinDix Hope you have a nice start of the new year. I think you could be of great help for us configuring climatological aerosols for AM3, so I tag you here :wink: (sorry if you are already busy with all other duties). Claire points me to your post Run with aerosol climatologies and I assume you managed to run AM3-N96 with climatological aerosols with some changes in aeroclim-new-ancils.

May I ask, would you suggest to modify the alpha release in the same way as you did in the branch aeroclim-new-ancils to run with climatological aerosols? Or is there a simpler way with minimum changes necessary to make it run? Under the NEW_ANCIL_DIR, there are only two folders (n216e and n96e), so would it be more complicated for the high-res n512e?

Another question for NRI: as I mentioned to @lachlanswhyborn long ago, the current ancillaries normaly extend up to 2014, could we get NRI-support to easily extend the simulation to 2024/25 for a better comparison with recent observations (e.g. Himawari)?

(It seems I am keeping everyone busy in a holiday season, sorry for that;)

@qinggangg We are planning for a beta release of ACCESS-AM3 in the next 6 months, maybe earlier. There is a lot to do before the beta release, so the release date is still very vague.

We will try to implement as much of the feedback from the alpha-release as possible into the next release. At this point, it is hard to tell when any part of the work will be done. With the holidays, we have yet to meet and decide on prioritisation of the tasks for the beta release.

This means I don’t know when we will have the time to work on any of this, but we will keep you updated on timelines when they become clearer for us. Keep asking questions as we might be able to provide temporary solutions.

1 Like

Thank you @clairecarouge Sure, that makes sense. I fully understand. I will keep posting issues if I find so it may be resolved in the beta release or I may receive some suggestions. Feel free to decide your priorities, I will also try myself to find workaround.

8 posts were split to a new topic: Running AM3 with COSP2

Hi, just a side question, is there a plan for ACCESS-NRI to dedicate some efforts to find a most cost-efficient setup for running AM3 N96? Including the CPUs decomposition, IO server, and model executable compilation options like what @Paul.Gregory did for rAM3?

Hi @qinggangg,

I’ve done this for AM3 n512, I can help activating the IO server for n96. Finding the most cost-efficient decomposition requires a lot of try and error, I think is safe to keep using the default configuration (already tested by many) unless you see the model is running very inefficiently.

Hi @paocorrales thank you. That sounds good. I will leave the decomposition as default then.

How did you activate the IO server? Does it help to speed up the model output?

When you activate the IO server you allocate processors to the task of reading/writing files. This way the model will write the output files while running the model (instead of having to stop the integration to write the output). You will see a reduction in the wall time needed.

The main things you need to do is use 2 or more OpenMP threads and define ios_nproc:

  • rose-app.conf
+MAIN_IOS_NPROC=32    # number of cores for IO
+MAIN_OMPTHR_ATM=2 # two OpenMP threads
  • app/um/rose-app.conf
[namelist:io_control]
io_alltoall_readflds=.true.
io_external_control=.false.
io_filesystem_profile=0
+io_timing=1
l_postp=.true.
print_memory_info=.false.
+print_runtime_info=.true.  # Only for diagnostics

[namelist:ioscntl]
ios_acquire_model_prsts=.true.
ios_as_concurrency=20
ios_async_levs_per_pack=76
ios_backoff_interval=1000
ios_buffer_size=1500
ios_concurrency=40
ios_concurrency_max_mem=40
ios_decomp_model=0
ios_enable_mpiio=.false.
+ios_interleave=.false.
ios_local_ro_files=.true.
ios_lock_meter=.false.
ios_no_barrier_fileops=.true.
ios_offset=0
ios_print_start_time=.false.
ios_relaytoslaves=.false.
ios_serialise_mpi_calls=.false.
+ios_spacing=24                  #ideally one per node
+ios_tasks_per_server=4  #MAIN_IOS_NPROC/ios_tasks_per_server=n--> write nfiles in parallel 
ios_thread_0_calls_mpi=.false.
ios_timeout=120
ios_together_end=.false.
+ios_unit_alloc_policy=5
ios_use_async_dump=.false.
ios_use_async_stash=.false.
ios_use_helpers=.false.
+ios_verbosity=5

This is more or less what we are using for n512 but it’s going to change according to the final config.

As a first step, activating the IO server will save a good amount of walltime/SUs. Them we can refine the configuration (number of procesors, tasks per server, etc) that depends on the output.

1 Like

Thank you @paocorrales I’ll have a look.