@Aidan /g/data/p66/jxs599/ESM16/PAYU/Dev/AprilSpinUp*
@manodeep - my understanding was that pr42 was merged back into ‘preindustrial+concentration’ and so already in ‘20250409-spinup-dev-preindustrial+concentrations’
Indeed (by default) the relevant config.yaml have ‘queue: normalsr’ which I can only assume makes its way to the PBS directive, although there could be other stuff I guess?
Hi @Aidan. For the ESM1.5 you can look at /g/data/p66/txz599/ACCESS-ESM1p5/exp/ESM-ZEC-dn2p0, although this is still using the script based approach (same way we ran for CMIP6).
Thanks @spencerwong and @manodeep. I think our priority is on decreasing the walltime, given the length of spinup and control runs.
Thanks @tiloz and @Jhan for sharing those details. A couple of details on the timing for the different post-processing strategies is available here. These tests were done before the switch to Sapphire Rapids, though the main results were:
- Using an
io_layout
of 1,1 and removing the collation step (the current strategy) added ~5-6 minutes per run compared to the original strategy (which regularly lead to collation failures), however would guarantee that collation failures would not happen. - Using different
io_layout
settings could reduce the walltime by ~2-5 minutes, however would still require the collation to be active. We think that these settings could reduce the risk of failure compared to the original settings, but weren’t able to guarantee that failures wouldn’t happen.
Something else that came up in our tests that’s worth noting in comparisons was that the ESM1.6 code is slower than ESM1.5. Earlier tests on the CL cores found a ~14 minute slowdown per year in ESM1.6 compared to 1.5, which appeared to partially come from the increased number of ocean tracers.
kinda expected- at least from my perspective.
@Jhan If you want to run on the cascadelake
queue, then you have to make the following changes to the config file (essentially reverting the changes needed to go from cascadelake
to sapphirerapids
):
- change queue name to
normal
(rather thannormalsr
) - remove the three lines beginning with, and including the
platform
line - change UM
ncpus
to 192 with a16x12
layout (change atmosphere->ncpus
inconfig.yaml
and changeUM_ATM_NPROCY
andUM_NPES
inatmosphere/um_env.yaml
) - change ocean cores to 180 with a
18x10
layout (change ocean->ncpus
inconfig.yaml
, and change the ocean layout ininput.nml
)
@spencerwong Are there any other manifest/checksum-type changes to run the current config on cascadelake
?
There shouldn’t be other manifest/checksum changes required to run on CL!
I don’t think it is necessary to do these tests. As far as I can see the runtime is as expected.
The confusion came from an apples and oranges comparison between ESM1.5 and ESM1.6, which are not expected to have similar runtimes due to the overhead of extra tracers from WOMBATlite.
Thanks everyone for clarifying. I initially expected to see a walltime closer to 1h as in @manodeep test case, but that was probably based on a configuration without the new WOMBAT version? All good.
For info all - a quick and dirty first look at the first 10-20 years of both of the new runs indicates:
- the year 0 water balance issue has been solved (presumably by using the updated land initial conditions)
- global runoff amounts are very similar to the previous CABLE3 test runs and existing log run.
- surface energy balance (-0.6W/m2 over land) is similar to the various CABLE3 test runs (ranging from -0.55 to -0.6W/m2) and, hence, a bit worse than the existing long run (-0.35W/m2).
- no immediately obvious impacts (e.g. on rainfall, temperature, dust etc.)
It’s still early days but this is mostly promising. It’s still too early to check on ocean temperature/salinity (and hence determine which of the two runs would be better and/or determine we need a different value for lprec0
. I am expecting an initial period of cooling because of the change to the solar constant - there may also be a CABLE related impact as well to keep an eye out for. Hopefully any salinity signal will emerge sooner.
Sounds promising. Thanks @inh599 for doing a sanity check. If it runs stable over the weekend we should be able to do more analysis next week. For ocean temp and salinity we probably need a couple of hundred years.
I’ll keep an eye on it over the weekend.
Currently at ~39-40 years, continuously (in lprec>0 run).
There does seem to be a slight performance gain. From memory, CMIP6/ESM1.5, we got ~16 years per day. Assuming the timestamp left on the output/restart directories to be an indicator of a complete model year, ESM1.6 is doing 19-20 years per day.
This may not be an apples-apples comparison, but it is an effective throughput comparison. Assuming the used wall times Tilo mentioned are still the case, I suspect that it is spending less time in the queue. This could also be due to current uptake on sapphire rapids. Significantly, it isnt worse, we cant go back to cascade lakes anyway and this includes output conversion which so far has not failed once across the combined ~75 years.
~75(70) years in lprec>0(=0) respectively. No failures. No conversion failures.
The ~75 year TAS looks okay I guess. The first ten years was bound to be a shock
No idea about the ocean fields Ian has been looking at
Thanks for keeping an eye on this. TAS adjusts relatively quickly, but ocean temperature will take a lot longer. TAS looks like ~0.4K lower, roughly what we would expect from the changes to the solar constant.
oops I meant screen temp. IDK the answer sorry. this is the lprec>0 case BTW.
This interesting, The lprec=0 case:
it looks like something has gone wrong with the smoother, but it is exactly the same syntax as the other case
The minutes of today’s spin-up are here. Please correct / amend as required.
Just a quick plot of GPP (annual mean values [PgC/yr]) for the new run (lprec>0):
Mean (~104 Pg/yr) is very similar to our ESM1.5 control run (~106 PgC/yr), but variability is lower in the new spinup (~1.3 PgC/yr standard deviation) in comparison to ESM1.5 (~2.1 PgC/yr standard deviation)