ACCESS-ESM1.6 development

Jhan · 11 April 2025 00:02

@Aidan /g/data/p66/jxs599/ESM16/PAYU/Dev/AprilSpinUp*

Jhan · 11 April 2025 00:12

@manodeep - my understanding was that pr42 was merged back into ‘preindustrial+concentration’ and so already in ‘20250409-spinup-dev-preindustrial+concentrations’

Indeed (by default) the relevant config.yaml have ‘queue: normalsr’ which I can only assume makes its way to the PBS directive, although there could be other stuff I guess?

tiloz · 11 April 2025 00:40

Hi @Aidan. For the ESM1.5 you can look at /g/data/p66/txz599/ACCESS-ESM1p5/exp/ESM-ZEC-dn2p0, although this is still using the script based approach (same way we ran for CMIP6).

tiloz · 11 April 2025 00:47

Thanks @spencerwong and @manodeep. I think our priority is on decreasing the walltime, given the length of spinup and control runs.

spencerwong · 11 April 2025 00:53

Thanks @tiloz and @Jhan for sharing those details. A couple of details on the timing for the different post-processing strategies is available here. These tests were done before the switch to Sapphire Rapids, though the main results were:

Using an io_layout of 1,1 and removing the collation step (the current strategy) added ~5-6 minutes per run compared to the original strategy (which regularly lead to collation failures), however would guarantee that collation failures would not happen.
Using different io_layout settings could reduce the walltime by ~2-5 minutes, however would still require the collation to be active. We think that these settings could reduce the risk of failure compared to the original settings, but weren’t able to guarantee that failures wouldn’t happen.

spencerwong · 11 April 2025 01:06

Something else that came up in our tests that’s worth noting in comparisons was that the ESM1.6 code is slower than ESM1.5. Earlier tests on the CL cores found a ~14 minute slowdown per year in ESM1.6 compared to 1.5, which appeared to partially come from the increased number of ocean tracers.

Jhan · 11 April 2025 01:45

kinda expected- at least from my perspective.

manodeep · 11 April 2025 02:00

@Jhan If you want to run on the cascadelake queue, then you have to make the following changes to the config file (essentially reverting the changes needed to go from cascadelake to sapphirerapids):

change queue name to normal (rather than normalsr)
remove the three lines beginning with, and including the platform line
change UM ncpus to 192 with a 16x12 layout (change atmosphere->ncpus in config.yaml and change UM_ATM_NPROCY and UM_NPES in atmosphere/um_env.yaml)
change ocean cores to 180 with a 18x10 layout (change ocean->ncpus in config.yaml, and change the ocean layout in input.nml)

@spencerwong Are there any other manifest/checksum-type changes to run the current config on cascadelake?

spencerwong · 11 April 2025 02:17

There shouldn’t be other manifest/checksum changes required to run on CL!

Aidan · 11 April 2025 02:18

I don’t think it is necessary to do these tests. As far as I can see the runtime is as expected.

The confusion came from an apples and oranges comparison between ESM1.5 and ESM1.6, which are not expected to have similar runtimes due to the overhead of extra tracers from WOMBATlite.

tiloz · 11 April 2025 03:05

Thanks everyone for clarifying. I initially expected to see a walltime closer to 1h as in @manodeep test case, but that was probably based on a configuration without the new WOMBAT version? All good.

inh599 · 11 April 2025 04:53

For info all - a quick and dirty first look at the first 10-20 years of both of the new runs indicates:

the year 0 water balance issue has been solved (presumably by using the updated land initial conditions)
global runoff amounts are very similar to the previous CABLE3 test runs and existing log run.
surface energy balance (-0.6W/m2 over land) is similar to the various CABLE3 test runs (ranging from -0.55 to -0.6W/m2) and, hence, a bit worse than the existing long run (-0.35W/m2).
no immediately obvious impacts (e.g. on rainfall, temperature, dust etc.)

It’s still early days but this is mostly promising. It’s still too early to check on ocean temperature/salinity (and hence determine which of the two runs would be better and/or determine we need a different value for lprec0. I am expecting an initial period of cooling because of the change to the solar constant - there may also be a CABLE related impact as well to keep an eye out for. Hopefully any salinity signal will emerge sooner.

tiloz · 11 April 2025 05:05

Sounds promising. Thanks @inh599 for doing a sanity check. If it runs stable over the weekend we should be able to do more analysis next week. For ocean temp and salinity we probably need a couple of hundred years.

Jhan · 11 April 2025 05:22

I’ll keep an eye on it over the weekend.

Currently at ~39-40 years, continuously (in lprec>0 run).

There does seem to be a slight performance gain. From memory, CMIP6/ESM1.5, we got ~16 years per day. Assuming the timestamp left on the output/restart directories to be an indicator of a complete model year, ESM1.6 is doing 19-20 years per day.

This may not be an apples-apples comparison, but it is an effective throughput comparison. Assuming the used wall times Tilo mentioned are still the case, I suspect that it is spending less time in the queue. This could also be due to current uptake on sapphire rapids. Significantly, it isnt worse, we cant go back to cascade lakes anyway and this includes output conversion which so far has not failed once across the combined ~75 years.

Jhan · 13 April 2025 22:56

~75(70) years in lprec>0(=0) respectively. No failures. No conversion failures.

The ~75 year TAS looks okay I guess. The first ten years was bound to be a shock

No idea about the ocean fields Ian has been looking at

tiloz · 13 April 2025 23:33

Thanks for keeping an eye on this. TAS adjusts relatively quickly, but ocean temperature will take a lot longer. TAS looks like ~0.4K lower, roughly what we would expect from the changes to the solar constant.

Jhan · 13 April 2025 23:43

oops I meant screen temp. IDK the answer sorry. this is the lprec>0 case BTW.

This interesting, The lprec=0 case:

it looks like something has gone wrong with the smoother, but it is exactly the same syntax as the other case

clairecarouge · 15 April 2025 00:40

The minutes of today’s spin-up are here. Please correct / amend as required.

tiloz · 15 April 2025 01:49

Just a quick plot of GPP (annual mean values [PgC/yr]) for the new run (lprec>0):

Mean (~104 Pg/yr) is very similar to our ESM1.5 control run (~106 PgC/yr), but variability is lower in the new spinup (~1.3 PgC/yr standard deviation) in comparison to ESM1.5 (~2.1 PgC/yr standard deviation)

tiloz · 15 April 2025 02:06

Net flux for the land looks pretty good already (annual mean values, [PgC/yr]):

Mean of -0.02 PgC over 98 years. This is basically the same as the long term (over 500 years) trend in ESM1.5, but different sign (need to double check). Also need to check if this holds on the PFT level.

Topic		Replies	Views
CSIRO - ACCESS-NRI standup minutes CMIP7 development	58	1199	20 October 2025
ESM1.6 Spin-up experiments CMIP7 development data , control , esm , access-esm	0	463	25 February 2025
ACCESS-ESM1.6 scope and use for CMIP7 Fast Track CMIP7 Models	18	413	19 July 2024
ACCESS-ESM1.5: ACCESS-NRI flagship release plans Earth System Model access-nri , access-esm	17	390	1 July 2024
ESM1.6 Development using NRI repos and PAYU CMIP7 development	26	108	27 November 2024

ACCESS-ESM1.6 development

Related topics