Please reply to this topic if you have feedback on the ACCESS-AM3 Alpha. We are primarily looking for feedback on the usability of the build system and configuration, and the documentation. We are also happy to receive science-related feedback, which we will address later in the release process. Feedback can be to point out problems encountered, or positive to highlight what worked well.
If your feedback is involved, please make an issue on the configuration repository (see this post if you do not yet have access to this repository).
If youāre not sure, reply here and your query can be moved to a GitHub issue if required.
Thanks for the work again. I could run the original suite and one with modifications. I spot a few issues.
The jobs atmos_main and netcdf_conversion fail irregularly, once in a few years. For example see /home/563/qg8515/scratch/cylc-run/access-am3-configs/log/job/19911101T0000Z/atmos_main/01 and /home/563/qg8515/scratch/cylc-run/am3-plus4k/log/job/19860601T0000Z/netcdf_conversion/01. The log files do not provide much information. I am not sure whether it is a gadi problem or I output too many variables. Anyway, it would succeed after rerun again.
Unfortunately, the failed jobs do not resubmit themselves, so I have to babysit them and trigger a rerun after they fail. And they also do not send a email notification about failure. I thought I could set execution retry delays in the file /home/563/qg8515/roses/access-am3-configs/site/nci_gadi.rc as 10*PT1M. But it does not work.
I tried to set EXPT_AEROSOLS='aeroclim' in rose-suite.conf to run climatological aerosols. It again failed without much information (just segmentation fault). @clairecarouge already helped to look into it, but still unresolved. The suite is here: /home/563/qg8515/roses/am3-climaerosol, and the log is here: /home/563/qg8515/scratch/cylc-run/am3-climaerosol/log/job/19820101T0000Z/atmos_main/01. I am working into it. If you have any ideas, Iām happy to implement.
FATAL: container creation failed: mount /proc/self/fd/10->/opt/nci/singularity/3.11.3/var/singularity/mnt/session/overlay-images/0 error: while mounting image /proc/self/fd/10: failed to find loop device: could not attach image file to loop device: failed to attach loop device: transient error, please retry: resource temporarily unavailable
It happens to me a few times today. I donāt think there is a problem with the suite, it can be annoying but if you trigger the job again it should work.
I may be wrong but I think this is for when the process fails to be submitted to the queue. Not for when the process fails.
The container creation failed error is something that I have experienced with all UM suites Iāve run on gadi. It seems to be a persistent transient (and annoying! Scott can confirm) error, but not an issue with the individual suite itself.
If you manually re-trigger the job it should usually then run to success. (In my experience this can even sometimes take a few tries).
Cylc mon is a way to monitor and trigger in a terminal (if you have shut down your gui or donāt want to view it in the gui).
Containers like the xp65 environments use āloop devicesā to load. Depending on which node you get put on there can be a limited number of these loop devices available, some have a couple hundred some have only 12.
Talking to the NCI folks the loop devices should get automatically created by the container so it shouldnāt matter how many are listed before you load xp65. Something to try is to increase the number of cpus requested so there are less jobs on a single node.
Hi @MartinDix Hope you have a nice start of the new year. I think you could be of great help for us configuring climatological aerosols for AM3, so I tag you here (sorry if you are already busy with all other duties). Claire points me to your post Run with aerosol climatologies and I assume you managed to run AM3-N96 with climatological aerosols with some changes in aeroclim-new-ancils.
May I ask, would you suggest to modify the alpha release in the same way as you did in the branch aeroclim-new-ancils to run with climatological aerosols? Or is there a simpler way with minimum changes necessary to make it run? Under the NEW_ANCIL_DIR, there are only two folders (n216e and n96e), so would it be more complicated for the high-res n512e?
Another question for NRI: as I mentioned to @lachlanswhyborn long ago, the current ancillaries normaly extend up to 2014, could we get NRI-support to easily extend the simulation to 2024/25 for a better comparison with recent observations (e.g. Himawari)?
(It seems I am keeping everyone busy in a holiday season, sorry for that;)
clairecarouge
(Claire Carouge, ACCESS-NRI Land Modelling Team Lead)
14
@qinggangg We are planning for a beta release of ACCESS-AM3 in the next 6 months, maybe earlier. There is a lot to do before the beta release, so the release date is still very vague.
We will try to implement as much of the feedback from the alpha-release as possible into the next release. At this point, it is hard to tell when any part of the work will be done. With the holidays, we have yet to meet and decide on prioritisation of the tasks for the beta release.
This means I donāt know when we will have the time to work on any of this, but we will keep you updated on timelines when they become clearer for us. Keep asking questions as we might be able to provide temporary solutions.
Thank you @clairecarouge Sure, that makes sense. I fully understand. I will keep posting issues if I find so it may be resolved in the beta release or I may receive some suggestions. Feel free to decide your priorities, I will also try myself to find workaround.
Hi, just a side question, is there a plan for ACCESS-NRI to dedicate some efforts to find a most cost-efficient setup for running AM3 N96? Including the CPUs decomposition, IO server, and model executable compilation options like what @Paul.Gregory did for rAM3?
Iāve done this for AM3 n512, I can help activating the IO server for n96. Finding the most cost-efficient decomposition requires a lot of try and error, I think is safe to keep using the default configuration (already tested by many) unless you see the model is running very inefficiently.
When you activate the IO server you allocate processors to the task of reading/writing files. This way the model will write the output files while running the model (instead of having to stop the integration to write the output). You will see a reduction in the wall time needed.
The main things you need to do is use 2 or more OpenMP threads and define ios_nproc:
rose-app.conf
+MAIN_IOS_NPROC=32 # number of cores for IO
+MAIN_OMPTHR_ATM=2 # two OpenMP threads
app/um/rose-app.conf
[namelist:io_control]
io_alltoall_readflds=.true.
io_external_control=.false.
io_filesystem_profile=0
+io_timing=1
l_postp=.true.
print_memory_info=.false.
+print_runtime_info=.true. # Only for diagnostics
[namelist:ioscntl]
ios_acquire_model_prsts=.true.
ios_as_concurrency=20
ios_async_levs_per_pack=76
ios_backoff_interval=1000
ios_buffer_size=1500
ios_concurrency=40
ios_concurrency_max_mem=40
ios_decomp_model=0
ios_enable_mpiio=.false.
+ios_interleave=.false.
ios_local_ro_files=.true.
ios_lock_meter=.false.
ios_no_barrier_fileops=.true.
ios_offset=0
ios_print_start_time=.false.
ios_relaytoslaves=.false.
ios_serialise_mpi_calls=.false.
+ios_spacing=24 #ideally one per node
+ios_tasks_per_server=4 #MAIN_IOS_NPROC/ios_tasks_per_server=n--> write nfiles in parallel
ios_thread_0_calls_mpi=.false.
ios_timeout=120
ios_together_end=.false.
+ios_unit_alloc_policy=5
ios_use_async_dump=.false.
ios_use_async_stash=.false.
ios_use_helpers=.false.
+ios_verbosity=5
This is more or less what we are using for n512 but itās going to change according to the final config.
As a first step, activating the IO server will save a good amount of walltime/SUs. Them we can refine the configuration (number of procesors, tasks per server, etc) that depends on the output.