ACCESS-AM3 1.0 Alpha Feedback

lachlanswhyborn · 13 November 2025 00:36

This topic is a catch-all location for feedback for the ACCESS-AM3 1.0 Alpha Release.

Please reply to this topic if you have feedback on the ACCESS-AM3 Alpha. We are primarily looking for feedback on the usability of the build system and configuration, and the documentation. We are also happy to receive science-related feedback, which we will address later in the release process. Feedback can be to point out problems encountered, or positive to highlight what worked well.

If your feedback is involved, please make an issue on the configuration repository (see this post if you do not yet have access to this repository).

If you’re not sure, reply here and your query can be moved to a GitHub issue if required.

paocorrales · 11 December 2025 04:03

I was able to run the model following the instructions with minor changes (see bellow)

A few comments:

README in dev-n96e:

It says rose_suite.conf_nci_gadi, it should be rose-suite.conf_nci_gadi
There is a broken link “ACCESS-Hive” that points to https://access-hive.org.au/models/run-a-model/run_access-cm2/

Also, following the instructions to change changed projects on rose-suite.conf_nci_gadi didn’t work. The suite fails with:

[FAIL] [Errno 13] Permission denied: '/scratch/tm70'

I had to change projects also it in rose-suite.conf.

qinggangg · 8 January 2026 03:44

Thanks for the work again. I could run the original suite and one with modifications. I spot a few issues.

The jobs atmos_main and netcdf_conversion fail irregularly, once in a few years. For example see /home/563/qg8515/scratch/cylc-run/access-am3-configs/log/job/19911101T0000Z/atmos_main/01 and /home/563/qg8515/scratch/cylc-run/am3-plus4k/log/job/19860601T0000Z/netcdf_conversion/01. The log files do not provide much information. I am not sure whether it is a gadi problem or I output too many variables. Anyway, it would succeed after rerun again.
Unfortunately, the failed jobs do not resubmit themselves, so I have to babysit them and trigger a rerun after they fail. And they also do not send a email notification about failure. I thought I could set execution retry delays in the file /home/563/qg8515/roses/access-am3-configs/site/nci_gadi.rc as 10*PT1M. But it does not work.
I tried to set EXPT_AEROSOLS='aeroclim' in rose-suite.conf to run climatological aerosols. It again failed without much information (just segmentation fault). @clairecarouge already helped to look into it, but still unresolved. The suite is here: /home/563/qg8515/roses/am3-climaerosol, and the log is here: /home/563/qg8515/scratch/cylc-run/am3-climaerosol/log/job/19820101T0000Z/atmos_main/01. I am working into it. If you have any ideas, I’m happy to implement.

paocorrales · 8 January 2026 05:05

The error asociated to netcdf_conversion is:

FATAL:   container creation failed: mount /proc/self/fd/10->/opt/nci/singularity/3.11.3/var/singularity/mnt/session/overlay-images/0 error: while mounting image /proc/self/fd/10: failed to find loop device: could not attach image file to loop device: failed to attach loop device: transient error, please retry: resource temporarily unavailable

It happens to me a few times today. I don’t think there is a problem with the suite, it can be annoying but if you trigger the job again it should work.

I may be wrong but I think this is for when the process fails to be submitted to the queue. Not for when the process fails.

bethanwhite · 8 January 2026 05:18

I agree with Pao for issues 1 and 2:

The container creation failed error is something that I have experienced with all UM suites I’ve run on gadi. It seems to be a persistent transient (and annoying! Scott can confirm) error, but not an issue with the individual suite itself.

If you manually re-trigger the job it should usually then run to success. (In my experience this can even sometimes take a few tries).

Cylc mon is a way to monitor and trigger in a terminal (if you have shut down your gui or don’t want to view it in the gui).

Scott · 8 January 2026 05:22

paocorrales:

The error asociated to netcdf_conversion is:

FATAL:   container creation failed: mount /proc/self/fd/10->/opt/nci/singularity/3.11.3/var/singularity/mnt/session/overlay-images/0 error: while mounting image /proc/self/fd/10: failed to find loop device: could not attach image file to loop device: failed to attach loop device: transient error, please retry: resource temporarily unavailable

Containers like the xp65 environments use ‘loop devices’ to load. Depending on which node you get put on there can be a limited number of these loop devices available, some have a couple hundred some have only 12.

[saw562@gadi-cpu-clx-2854 tmp]$ ls /dev/loop*
/dev/loop-control  /dev/loop0  /dev/loop1  /dev/loop10  /dev/loop11  /dev/loop2  /dev/loop3  /dev/loop4  /dev/loop5  /dev/loop6  /dev/loop7  /dev/loop8  /dev/loop9

If you get put on a node with only a few loop devices available then there may not be a slot available for you and mounting the container will fail.

You can see how the loop devices are being used with

$ losetup --list
NAME         SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE                                                                 DIO LOG-SEC
/dev/loop1 11380555776  36864         1  1 /g/data/dk92/apps/pet/2025.08/libexec/pet:2025.08.sif                       0     512
/dev/loop4  9030176768  40960         1  1 /g/data/dk92/apps/NCI-data-analysis/2024.05/libexec/nci-data-analysis.sif   0     512
/dev/loop2 13988851712      0         1  1 /g/data/xp65/public/apps/med_conda/envs/analysis3-25.10.sqsh                0     512
/dev/loop0        8192  40960         1  1 /g/data/xp65/public/apps/med_conda/etc/base.sif                             0     512
/dev/loop3 14036488192      0         1  1 /g/data/xp65/public/apps/med_conda/envs/analysis3-25.11.sqsh                0     512

qinggangg · 8 January 2026 05:42

Thank you all. execution retry delays works as I expect for rAM3, just not yet for AM3.

Scott · 8 January 2026 22:22

Talking to the NCI folks the loop devices should get automatically created by the container so it shouldn’t matter how many are listed before you load xp65. Something to try is to increase the number of cpus requested so there are less jobs on a single node.

bethanwhite · 8 January 2026 23:41

Interesting, thanks Scott.

qinggangg · 9 January 2026 01:41

Hi @MartinDix Hope you have a nice start of the new year. I think you could be of great help for us configuring climatological aerosols for AM3, so I tag you here (sorry if you are already busy with all other duties). Claire points me to your post Run with aerosol climatologies and I assume you managed to run AM3-N96 with climatological aerosols with some changes in aeroclim-new-ancils.

May I ask, would you suggest to modify the alpha release in the same way as you did in the branch aeroclim-new-ancils to run with climatological aerosols? Or is there a simpler way with minimum changes necessary to make it run? Under the NEW_ANCIL_DIR, there are only two folders (n216e and n96e), so would it be more complicated for the high-res n512e?

Another question for NRI: as I mentioned to @lachlanswhyborn long ago, the current ancillaries normaly extend up to 2014, could we get NRI-support to easily extend the simulation to 2024/25 for a better comparison with recent observations (e.g. Himawari)?

(It seems I am keeping everyone busy in a holiday season, sorry for that;)

clairecarouge · 9 January 2026 04:14

@qinggangg We are planning for a beta release of ACCESS-AM3 in the next 6 months, maybe earlier. There is a lot to do before the beta release, so the release date is still very vague.

We will try to implement as much of the feedback from the alpha-release as possible into the next release. At this point, it is hard to tell when any part of the work will be done. With the holidays, we have yet to meet and decide on prioritisation of the tasks for the beta release.

This means I don’t know when we will have the time to work on any of this, but we will keep you updated on timelines when they become clearer for us. Keep asking questions as we might be able to provide temporary solutions.

qinggangg · 9 January 2026 04:38

Thank you @clairecarouge Sure, that makes sense. I fully understand. I will keep posting issues if I find so it may be resolved in the beta release or I may receive some suggestions. Feel free to decide your priorities, I will also try myself to find workaround.

qinggangg · 2 March 2026 03:10

Hi, just a side question, is there a plan for ACCESS-NRI to dedicate some efforts to find a most cost-efficient setup for running AM3 N96? Including the CPUs decomposition, IO server, and model executable compilation options like what @Paul.Gregory did for rAM3?

paocorrales · 2 March 2026 03:15

Hi @qinggangg,

I’ve done this for AM3 n512, I can help activating the IO server for n96. Finding the most cost-efficient decomposition requires a lot of try and error, I think is safe to keep using the default configuration (already tested by many) unless you see the model is running very inefficiently.

qinggangg · 2 March 2026 03:26

Hi @paocorrales thank you. That sounds good. I will leave the decomposition as default then.

How did you activate the IO server? Does it help to speed up the model output?

paocorrales · 2 March 2026 05:12

When you activate the IO server you allocate processors to the task of reading/writing files. This way the model will write the output files while running the model (instead of having to stop the integration to write the output). You will see a reduction in the wall time needed.

The main things you need to do is use 2 or more OpenMP threads and define ios_nproc:

rose-app.conf

+MAIN_IOS_NPROC=32    # number of cores for IO
+MAIN_OMPTHR_ATM=2 # two OpenMP threads

app/um/rose-app.conf

[namelist:io_control]
io_alltoall_readflds=.true.
io_external_control=.false.
io_filesystem_profile=0
+io_timing=1
l_postp=.true.
print_memory_info=.false.
+print_runtime_info=.true.  # Only for diagnostics

[namelist:ioscntl]
ios_acquire_model_prsts=.true.
ios_as_concurrency=20
ios_async_levs_per_pack=76
ios_backoff_interval=1000
ios_buffer_size=1500
ios_concurrency=40
ios_concurrency_max_mem=40
ios_decomp_model=0
ios_enable_mpiio=.false.
+ios_interleave=.false.
ios_local_ro_files=.true.
ios_lock_meter=.false.
ios_no_barrier_fileops=.true.
ios_offset=0
ios_print_start_time=.false.
ios_relaytoslaves=.false.
ios_serialise_mpi_calls=.false.
+ios_spacing=24                  #ideally one per node
+ios_tasks_per_server=4  #MAIN_IOS_NPROC/ios_tasks_per_server=n--> write nfiles in parallel 
ios_thread_0_calls_mpi=.false.
ios_timeout=120
ios_together_end=.false.
+ios_unit_alloc_policy=5
ios_use_async_dump=.false.
ios_use_async_stash=.false.
ios_use_helpers=.false.
+ios_verbosity=5

This is more or less what we are using for n512 but it’s going to change according to the final config.

As a first step, activating the IO server will save a good amount of walltime/SUs. Them we can refine the configuration (number of procesors, tasks per server, etc) that depends on the output.

qinggangg · 2 March 2026 06:09

Thank you @paocorrales I’ll have a look.

JulieA · 27 April 2026 08:15

I tried running the N96 AM3 alpha release using the hive instructions (including prerequisites) which point to the release-n96e branch but it is failing in the recon phase. Note this is my first time running the UM in about 10 years and my MOSRS account was only reactivated on Friday.

It seems to be failing due to some permission errors. FYI ‘mosrs-auth’ doesn’t find my account, though I’ve been advised I don’t strictly need it to run the model? I followed the recently announced method to get my MOSRS account and have NOT joined the ki32_mosrs project at NCI but am a member of all the other projects listed.

====

Using the cylc session forcylc.jma548.fy29.ps.gadi.nci.org.au

Loading cylc7/24.03

Loading requirement: mosrs-setup/2.0.1

Loading access-am3/2025.10.000

Loading requirement: openmpi/4.1.7-ushgfj4 um/13.1-ailqa6r

[FAIL] file:spectral_files=source=fcm:socrates.xm_tr/data/spectra/ga9@um13.1: bad or missing value

2026-04-27T01:52:39Z CRITICAL - failed/EXIT

====

Log files are at /home/548/jma548/cylc-run/access-am3-configs/log/job/19820101T0000Z/recon/01

@paocorrales has downloaded the alpha release version again today and it runs for her so must be something with my setup. Thanks for any help!

JulieA · 27 April 2026 08:19

A few other things for feedback

the ACCESS-AM3 configs docs button on Run ACCESS-AM3 - ACCESS-Hive Docs has a broken link
Home - ACCESS-AM3 configurations documentation says the supported branch is “dev-n96e” whereas the instructions on the main page point to “release-n96e”

Topic		Replies	Views
ACCESS-rAM3: Release Information ACCESS-NRI Releases release , model , access-ram3	4	704	19 December 2025
ACCESS-ROM3 setup instructions Regional MOM6 regional , tutorial , om3	78	1184	12 June 2025
Meeting Minutes 2026: Atmosphere Working Group Working Group meeting , atmosphere	6	120	24 June 2026
Request access to AM3 configurations AM3	9	166	25 May 2026
Running AM3 with COSP2 AM3 access-am3	35	212	10 March 2026

ACCESS-AM3 1.0 Alpha Feedback

Related topics