Hello.
Myself and @qinggangg have been unable to scale the rAM3 suite beyond about 50x50 processors.
The UM will integrate in time successfully. E.g. on 72x70 processors it will compute a 6 hour forecast on the 1 km domain (21112 x 2000 points at 0.009 degree resolution) in about 13.3 minutes. The task then hangs, and will exceed the available wall time. The default wall time is 30 mins, but I’ve tried up to 2 hours with no luck. A task split across more that 50x50 processors will sometimes work, but 90% of the time it will fail due to wall time exceedance.
There is no error message (beyond wall time exceedance). The UM is running
stash_gather_field.F90
when PBS terminates it. The final output from the umnsa.fort6.pe0
(the Rank 0 MPI process) is
********************************************************************************
Atm_Step: Info: timestep 360 took 23.210 seconds
********************************************************************************
GET_FILENAME: Generated filename:/home/548/pag548/cylc-run/u-dq126/share/cycle/20220226T0000Z/Flagship_ERA5to1km/1km/RAL3P2/um//../ics/umnsaa_da006
GET_FILENAME: (From): /home/548/pag548/cylc-run/u-dq126/share/cycle/20220226T0000Z/Flagship_ERA5to1km/1km/RAL3P2/um//../ics/umnsaa_d%z%N
FILE_MANAGER: Assigned : /home/548/pag548/cylc-run/u-dq126/share/cycle/20220226T0000Z/Flagship_ERA5to1km/1km/RAL3P2/um//../ics/umnsaa_da006
FILE_MANAGER: : Unit : 12 (portio)
DUMPCTL: Opening new file /home/548/pag548/cylc-run/u-dq126/share/cycle/20220226T0000Z/Flagship_ERA5to1km/1km/RAL3P2/um//../ics/umnsaa_da006 on unit 12
OPEN: File /home/548/pag548/cylc-run/u-dq126/share/cycle/20220226T0000Z/Flagship_ERA5to1km/1km/RAL3P2/um//../ics/umnsaa_da006 to be Opened on Unit 12 Exists
OPEN: Claimed 4194304 Bytes for Buffering
OPEN: Buffer Address is 0x18a65a10
IO: Open: /home/548/pag548/cylc-run/u-dq126/share/cycle/20220226T0000Z/Flagship_ERA5to1km/1km/RAL3P2/um//../ics/umnsaa_da006 on unit 12
WRITING UNIFIED MODEL DUMP ON UNIT 12
#####################################
This 83Gb file assigned to unit 12 exists and I can read it. But the task still doesn’t exit correctly.
So given that I/O appears to be the bottleneck, I activated a small I/O server using 12 dedicated processors. The first UM task in my suite is a 12 km outer nest (580 x 780 at 0.11 degrees with 30x26 processors). The UM will complete 20 timesteps and then the I/O server will fail with this error
-------- IOS ERROR REPORT ---------------
????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
? Error code: 61
? Error from routine: IOS_INIT_MD
? Error message: A target object of 10 is not allowed
? Error from processor: 0
? Error number: 23
????????????????????????????????????????????????????????????????????????????????
This error is thrown in this section of /src/io_services/client/ios_client_queue.F90
ELSE IF (targetObject == file_op_pseudo_unit) THEN
!$OMP CRITICAL(internal_write)
WRITE(IOS_clq_message,'(A,I5,A)') 'A target object of ', &
targetObject,' is not allowed'
!$OMP END CRITICAL(internal_write)
errorFlag=61
CALL IOS_ereport( RoutineName, errorFlag, IOS_clq_message )
file_op_pseudo_unit
is the Fortran unit number computed in
SUBROUTINE assign_file_unit(filename, f_unit, handler, id, force)
in src/io_services/model_api/file_manager.F90
Here are the pertinent sections of umnsa.fort6.pe0
that reference Unit 10.
FILE_MANAGER: Assigned : pseudo-file for UNIX operations
FILE_MANAGER: : id : io_reserved_unit
FILE_MANAGER: : Unit : 10 (portio)
Running Atmospheric code as pe 0
MPPIO_File_Utils: Initialised file utils using unit 10
FILE_MANAGER: Assigned : Reserved unit for re-initialised stream (/home/548/pag548/cylc-run/u-dq126/share/cycle/20220226T0000Z/Flagship_ERA5to1km/12km/GAL9/um//umnsaa_pb%N.nc)
FILE_MANAGER: : id : usr1
FILE_MANAGER: : Unit : 10 (netcdf)
NCFILE_INIT: Opening new file /home/548/pag548/cylc-run/u-dq126/share/cycle/20220226T0000Z/Flagship_ERA5to1km/12km/GAL9/um//umnsaa_pb000.nc on unit 10
Creating netCDF4 classic model file /home/548/pag548/cylc-run/u-dq126/share/cycle/20220226T0000Z/Flagship_ERA5to1km/12km/GAL9/um//umnsaa_pb000.nc on unit 10
It appears to me the I/O server is trying to write to a Unit Number that has been reserved by the NetCDF output functions.
I’ve tried to see if these unit numbers can be overridden with a namelist, but no luck.
So the question is, who here has run the UM with an I/O server? And have you ever had to deal with conflicting unit numbers?
The AUS2000 suite used an I/O server, and the BoM operational suites
- ACCESS-S
- BARRA-R2
- NAS
all use it too.
I’ve tried the UM task with ios_unit_alloc_policy
= 1,2,3 in theIOSCNTL
namelist , i.e
- Static allocation based on unit number
- Static allocation based on usage order
- Round robin dynamic allocation
And the error remains the same for all three options.
Cheers