ESMF mesh and MOM6 domain masks are inconsistent

Ok now I have progress. The job runs for about a minute. Things begin to happen, but it still fails.

The error from access-om3.err is

FATAL from PE    82: ERROR: ESMF mesh and MOM6 domain masks are inconsistent! - MOM n, maskMesh(n), mask(n) =       81         1         0

FATAL from PE    86: ERROR: ESMF mesh and MOM6 domain masks are inconsistent! - MOM n, maskMesh(n), mask(n) =      306         1         0

Is this an error in my config files?

Or an issue with the regional meshes generated downstream of the COSIMA mom6 notebook ?

I note there are two files in my INPUT_DIR:
access-rom3-nomask-ESMFmesh.nc
access-rom3-ESMFmesh.nc

Yet I only refer to one of them in my input configurations, e.g. from datm_in:

  model_maskfile = "./INPUT/access-rom3-nomask-ESMFmesh.nc"
  model_meshfile = "./INPUT/access-rom3-nomask-ESMFmesh.nc"

Note the inputs in my config.yaml are:

    - /g/data/vk83/configurations/inputs/access-om3/share/meshes/share/2024.09.16/JRA55do-datm-ESMFmesh.nc
    - /g/data/vk83/configurations/inputs/access-om3/share/meshes/share/2024.09.16/JRA55do-drof-ESMFmesh.nc
    - /g/data/vk83/configurations/inputs/access-om3/share/grids/global.1deg/2020.10.22/topog.nc
    - /g/data/vk83/configurations/inputs/access-om3/mom/surface_salt_restoring/global.1deg/2020.05.30/salt_sfc_restore.nc
    - /g/data/vk83/configurations/inputs/access-om3/cice/grids/global.1deg/2024.05.14/grid.nc
    - /g/data/vk83/configurations/inputs/access-om3/cice/grids/global.1deg/2024.05.14/kmt.nc
    - /g/data/vk83/configurations/inputs/access-om3/cice/initial_conditions/global.1deg/2023.07.28/iced.1900-01-01-10800.nc
    - /g/data/qv56/replicas/input4MIPs/CMIP6/OMIP/MRI/MRI-JRA55-do-1-4-0
    - /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/hgrid.nc
    - /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/vcoord.nc
    - /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/bathymetry.nc
    - /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/forcing/init_tracers.nc
    - /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/forcing/init_eta.nc
    - /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/forcing/init_vel.nc
    - /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/forcing/forcing_obc_segment_001.nc
    - /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/forcing/forcing_obc_segment_002.nc
    - /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/forcing/forcing_obc_segment_003.nc
    - /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/forcing/forcing_obc_segment_004.nc  
    - /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/grid_spec.nc
    - /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/ocean_mosaic.nc 
    - /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/access-rom3-ESMFmesh.nc
    - /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/access-rom3-nomask-ESMFmesh.nc
    - /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/land_mask.nc 
1 Like

In nuopc.runconfig, there should be two lines using the version with the mask:

1 Like

Yep. That’s already set.

    mesh_mask = ./INPUT/access-rom3-ESMFmesh.nc
    mesh_ocn = ./INPUT/access-rom3-ESMFmesh.nc    

I’ll inspect all my mesh/domain .nc files to check the dimensions.

Just bumping this. I still can’t get this configuration to work. If someone is able to upload a working configuration somewhere, it would be helpful to diagnose my problems.

For the record, I had a look at my mesh files. The bathymetry.nc, land_mask.nc, ocean_mask.nc files all have the same dimensions : 140 x 249.

The access-rom3-ESMFmesh.nc is unstructured therefore has no fixed dimensions in the x and y dimensions.

The error message:

FATAL from PE 82: ERROR: ESMF mesh and MOM6 domain masks are inconsistent! - MOM n, maskMesh(n), mask(n) = 81 1 0 

is generated from the following code in subroutine InitializeRealize in mom_cap.F90

 if (abs(maskmesh(n) - mask(n)) > 0) then
  frmt = "('ERROR: ESMF mesh and MOM6 domain masks are inconsistent! - "//&
  "MOM n, maskMesh(n), mask(n) = ',3(i8,2x))"
  write(err_msg, frmt)n,maskmesh(n),mask(n)

So in this error occurs for two values of n (81, 306) where n ranges from 1 tonumownedelements , where this value is computed using the external ESMF module ESMF_MeshGet.

In both these cases, the value of maskMesh(n)=1 where mask(n)=0.

Based on @mmr0’s running on a simple grid and smaller domain, I decided to reduce the job CPUs to 48. This generated a very different error:

FATAL from PE    11: MOM_domains_init: The product of the two components of layout,   10  10, is not the number of PEs used,    48.

FATAL from PE    15: MOM_domains_init: The product of the two components of layout,   10  10, is not the number of PEs used,    48.
etc

This error is generated from subroutine MOM_domains_init in MOM_domains.F90

    if (layout(1)*layout(2) /= PEs_used .and. (.not. mask_table_exists) ) then
      write(mesg,'("MOM_domains_init: The product of the two components of layout, ", &
            &      2i4,", is not the number of PEs used, ",i5,".")') &
            layout(1), layout(2), PEs_used
      call MOM_error(FATAL, mesg)
    endif

So there are clear rules regarding the components of layout an PE numbers.

The mask_table file is specified in MOM_layout:

MASKTABLE = mask_table.2.10x10

This is an ASCII file that contains the following:

$ more mask_table.2.10x10 
2
10, 10
5,7
6,7

The logic of the above Fortan code states that ERROR message can only ever be printed if mask_table_exists is false, i.e. the code ignores the presence of the mask_table.2.10x10 file. Which is interesting.

I commented out the MASKTABLE line in MOM_layout but it doesn’t seem to make a difference.

The MOM_domain file does contain the following:

LAYOUT = 10,10    

So I did some digging into how MOM6 reads the input files. I found the logic which reads in MOM_domain, that is contained in:

  • subroutines get_param_real, get_param_int etc. in MOM_file_parser.f90
  • subroutine initialize_MOM in MOM.F90.
  • subroutine MOM_layout in MOM_domains.F90`

But I haven’t yet been able to find any code which reads in datm_in and drof_in.

I would like to try a configuration that enforces 100 CPUs to align with the 10x10 layout, but payu contains the following code:

    # Increase the CPUs to accommodate the cpu-per-node request
    if n_cpus > max_cpus_per_node and (node_increase or node_misalignment):

        # Number of requested nodes
        n_nodes = 1 + (n_cpus - 1) // n_cpus_per_node
        n_cpu_request = max_cpus_per_node * n_nodes
        n_inert_cpus = n_cpu_request - n_cpus

        print('payu: warning: Job request includes {n} unused CPUs.'
              ''.format(n=n_inert_cpus))

        # Increase CPU request to match the effective node request
        n_cpus = max_cpus_per_node * n_nodes

        # Update the ncpus field in the config
        if n_cpus != n_cpus_request:
            print('payu: warning: CPU request increased from {n_req} to {n}'
                  ''.format(n_req=n_cpus_request, n=n_cpus))

So it always bumps the 100 CPUs up to 144.

@mmr0 - were you able to get your simple configuration running on 16 CPUs with a 4x4 layout?

Anyways - chasing my tail a bit here and I’m way out of my MOM6 depth (joke!).

I didn’t find much useful information at : Welcome to MOM6’s documentation! — MOM6 0.2a3 documentation

I can try running some of the test cases here : Getting started · NOAA-GFDL/MOM6-examples Wiki · GitHub

To see how the LAYOUT, CPUS and domain decomposition work.

Hi @Paul.Gregory - I haven’t forgotten! I am going to go through the instructions over the next couple of days to see if I can replicate your error.

@mmr0 has placed her configuration setup here

@Aidan has some instructions here on how to upload your configuration to github – any chance that you can please try uploading so we can see your files?

3 Likes
2 Likes

In the mesh files x and y are stacked. You can use numpy reshape to put them into a regular grid.

e.g. with an xarray dataset which contains the mesh file:

mesh_ds.elementMask.values.reshape(249,140)

arguments to reshape are (NY, NX)

This sounds like the problem - the ocean_mask.nc and access-rom3-ESMFmesh.nc are probably inconsistent.

Try setting

AUTO_MASKTABLE = True in MOM_input instead

This is correct behaviour - i.e. on gadi normal queue, for core counts greater than one node, a PBS job must request a multiple of one node (i.e. 48 processors). It doesn’t affect how many processors MOM uses - which is set in nuopc.runconfig

1 Like

Hi @mmr0. Your configuration setup here GitHub - mmr0/access-om3-configs at tassie-test is missing a MOM_layout file.

Could you add that to the repo please?

I’ve tried to replicate your configuration setup and your notebook (i.e. a 4x4 decomposition of Tasmania at 0.1 resolution) but it fails immediately with an MPI error when I provide with my own MOM_layout file.

I’m going to run your config from a clean directory (when it has a MOM_layout file) and see what happens.

Cheers

Note that if runlog: True is set in config.yaml (or not set at all, and defaults to True) then all the necessary configuration files are automatically added to the git repo and checked in when the model is run.

Just sayin’ … use runlog.

Hi Aidan

I’m not sure I understand you. Are you saying that when I run payu run it will automatically create a MOM_layout and add it to the remote repository?

Will it add it to the local copy on disk?

I tried using @mmr0 's config repo.

$ payu clone -b expt -B tassie-test https://github.com/mmr0/access-om3-configs.git access-rom3 
Cloned repository from https://github.com/mmr0/access-om3-configs.git to directory: /home/548/pag548/access-om3/access-rom3
Created and checked out new branch: expt
laboratory path:  /scratch/gb02/pag548/access-om3
binary path:  /scratch/gb02/pag548/access-om3/bin
input path:  /scratch/gb02/pag548/access-om3/input
work path:  /scratch/gb02/pag548/access-om3/work
archive path:  /scratch/gb02/pag548/access-om3/archive
Metadata and UUID generation is disabled. Experiment name used for archival: access-rom3
Added archive symlink to /scratch/gb02/pag548/access-om3/archive/access-rom3
Added work symlink to /scratch/gb02/pag548/access-om3/work/access-rom3
To change directory to control directory run:
  cd access-rom3

Then I altered config.yaml and commented out the runlog line but it payu setup fails with

FileNotFoundError: [Errno 2] No such file or directory: '/home/548/pag548/access-om3/access-rom3/MOM_layout'

Oh no, sorry for not being clear. If @mmr0 had runlog turned on then it would automatically add the correct files to the repo when the model is run.

This is one of the reasons it’s a good idea.

I have recently added to the instructions to direct everyone to turn on runlog:
In config.yaml we want to change runlog: false to runlog: true

This addition was after @mmr0 and @Paul.Gregory had already gone through the instructions (sorry!) but it would be great if you could turn it on now.

1 Like

@Paul.Gregory I went through the instructions and can get the model to run. The one thing I did differently to you is that I used the more recent (non dev) executable and the dev-1deg_jra55do_iaf branch. However, It seems unlikely to me that this is the issue so if you have time, I would like to delve a bit more into @Anton’s above suggestions.

If you haven’t already, can you please double check that you have deleted these lines from config.yaml:

 - /g/data/vk83/configurations/inputs/access-om3/share/meshes/global.1deg/2024.01.25/access-om2-1deg-ESMFmesh.nc
 - /g/data/vk83/configurations/inputs/access-om3/share/meshes/global.1deg/2024.01.25/access-om2-1deg-nomask-ESMFmesh.nc

@anton my understanding is that we don’t use the ocean_mask.nc at model run time but we use it along with hgrid.nc to create the mesh. Would the issue be that we are using an incompatible ocean_mask and hgrid.nc when generating the meshes?

@Paul.Gregory if the fresh version doesn’t work, would it be hard to try running with @mmr0 hgrid and meshes to see if we can rules out these as the issue?

Both files are used, annoyingly its a situation where the same information is captured in two files.

ocean_mask.nc is used by MOM, the mask in ESMFmesh.nc is used by the mediator. They should be the same. In the mom “cap”, it compares the two masks are the same, which is where the error is coming from.

As this is a regional domain, we might expect ocean everywhere, so all values of the mask should be 1 ?

You could plot the mask in the ESMF mesh using something like:

mesh_ds.elementMask.values.reshape(249,140)

and then compare to the ocean_mask.nc file. And it should be the same everywhere.

OK so I copied @mmr0 's configuration :
4x4 decomposition on a domain measuring 70x125 at 0.1 resolution,

to another directory and regenerated the default domain :

10x10 decomposition on a domain 140x249 at 0.5 resolution.

During the domain decomposition, the notebook generated the following:

Running GFDL's FRE Tools. The following information is all printed by the FRE tools themselves
NOTE from make_solo_mosaic: there are 0 contacts (align-contact)
congradulation: You have successfully run make_solo_mosaic
OUTPUT FROM MAKE SOLO MOSAIC:
CompletedProcess(args='/g/data/ik11/mom6_tools/tools/make_solo_mosaic/make_solo_mosaic --num_tiles 1 --dir . --mosaic_name ocean_mosaic --tile_file hgrid.nc', returncode=0)
cp: './ocean_mosaic.nc' and 'ocean_mosaic.nc' are the same file
cp: './hgrid.nc' and 'hgrid.nc' are the same file
cp ./hgrid.nc hgrid.nc 
NOTE from make_coupler_mosaic: the ocean land/sea mask will be determined by field depth from file bathymetry.nc
mosaic_file is grid_spec.nc
***** Congratulation! You have successfully run make_quick_mosaic
OUTPUT FROM QUICK MOSAIC:
CompletedProcess(args='/g/data/ik11/mom6_tools/tools/make_quick_mosaic/make_quick_mosaic --input_mosaic ocean_mosaic.nc --mosaic_name grid_spec --ocean_topog bathymetry.nc', returncode=0)
===>NOTE from check_mask: when layout is specified, min_pe and max_pe is set to layout(1)*layout(2)=100
===>NOTE from check_mask: Below is the list of command line arguments.
grid_file = ocean_mosaic.nc
topog_file = bathymetry.nc
min_pe = 100
max_pe = 100
layout = 10, 10
halo = 4
sea_level = 0
show_valid_only is not set
nobc = 0
===>NOTE from check_mask: End of command line arguments.
===>NOTE from check_mask: the grid file is version 2 (mosaic grid) grid which contains field gridfiles
==>NOTE from get_boundary_type: x_boundary_type is solid_walls
==>NOTE from get_boundary_type: y_boundary_type is solid_walls
==>NOTE from check_mask: Checking for possible masking:
==>NOTE from check_mask: Assume 4 halo rows
==>NOTE from check_mask: Total domain size is 140, 249
_______________________________________________________________________
NOTE from check_mask: The following is for using model source code with version older than siena_201207,
Possible setting to mask out all-land points region, for use in coupler_nmlTotal number of domains = 100
Number of tasks (excluded all-land region) to be used is 98
Number of regions to be masked out = 2
The layout is 10, 10
Masked and used tasks, 1: used, 0: masked
1111111111
1111111111
1111111111
1111001111
1111111111
1111111111
1111111111
1111111111
1111111111
1111111111
 domain decomposition
  14  14  14  14  14  14  14  14  14  14
  25  25  25  25  25  25  25  25  25  24
 used=98, masked=2, layout=10,10
 To chose this mask layout please put the following lines in ocean_model_nml and/or ice_model_nml
 nmask = 2
layout = 10, 10
mask_list = 5,7,6,7
_______________________________________________________________________
NOTE from check_mask: The following is for using model source code with version siena_201207 or newer,
                      specify ocean_model_nml/ice_model_nml/atmos_model_nml/land_model/nml 
                      variable mask_table with the mask_table created here.
                      Also specify the layout variable in each namelist using corresponding layout
***** Congratulation! You have successfully run check_mask
OUTPUT FROM CHECK MASK:

I then wrote the following the code to check the values of ocean_mask.nc and the ESMF mesh files.

input_dir=Path('/scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced')

# Load mask files

ocean_mask = xr.open_dataset(f'{input_dir}/ocean_mask.nc')
ESMF_mesh = xr.open_dataset(f'{input_dir}/access-rom3-ESMFmesh.nc')
ESMF_nomask_mesh = xr.open_dataset(f'{input_dir}/access-rom3-nomask-ESMFmesh.nc')

# Reconstruct the ESMF mesh as a 2-D array
ESMF_mask = xr.DataArray(ESMF_mesh.elementMask.values.reshape(249,140), 
                         dims=['ny','nx'],
                         coords={'ny':ocean_mask.ny, 
                                 'nx':ocean_mask.nx})

fig,ax=plt.subplots(1,3,figsize=(15,4.5))
ocean_mask.mask.plot(ax=ax[0])
ESMF_mask.plot(ax=ax[1])
delta = ocean_mask.mask - ESMF_mask
delta.plot(ax=ax[2])
fig.suptitle('Delta b/w ocean_mask and ESMF_mask')
plt.tight_layout()

The two meshes are identical.

The delta min/max value is zero.

Additionally, the value of access-rom3-nomask-ESMFmesh.nc is 1.0 everywhere.

The pays run task fails with the same error:

FATAL from PE    86: ERROR: ESMF mesh and MOM6 domain masks are inconsistent! - MOM n, maskMesh(n), mask(n) =      306         1         0

FATAL from PE    82: ERROR: ESMF mesh and MOM6 domain masks are inconsistent! - MOM n, maskMesh(n), mask(n) =       81         1         0

Is this related to the output of check_mask from the notebook earlier?

Total number of domains = 100
Number of tasks (excluded all-land region) to be used is 98
Number of regions to be masked out = 2
...
 domain decomposition
  14  14  14  14  14  14  14  14  14  14
  25  25  25  25  25  25  25  25  25  24
 used=98, masked=2, layout=10,10

i.e. there are two regions in the domain that don’t align somehow?

Do I need to follow the advice in check_mask output?

 To chose this mask layout please put the following lines in ocean_model_nml and/or ice_model_nml
 nmask = 2
layout = 10, 10
mask_list = 5,7,6,7

EDIT : I’m guessing not as this corresponds to contents of mask_table.2.10x10 ?

2
10, 10
5,7
6,7

Ok - thats interesting! That seems like a bug. My hunch would be to try without any masked blocks in the mask_table

But we use the AUTO_MASKTABLE option without trouble, so I am not sure

The notebooks are running a few processes that are not actually needed in the nuopc coupler - I think the mask_table file is something we are able to ignore

Thanks @anton.

So to remove the masked blocks in mask_table, do I remove the last two lines in the file?
So I change

2
10, 10
5,7
6,7

to

2
10,10

?

Or do I just clear all contents from the mask_table file?

BTW if this doesn’t work (and I note that you’re not too optimistic :wink: ) I’m happy to try and re-compile my own MOM6 executable with debugging flags and try to run debug the MPI process.

I’ve never debugged MPI but I have lots of experience using gbd and idb in Fortran so I’ll be able to make good progress once I know how to attach it to the mpi run process.

I’ve also decided to start reading the MOM6 docs from the beginning at Welcome to MOM6’s documentation! — MOM6 0.2a3 documentation

This probably works, I am not sure.

We set AUTO_MASKTABLE = True in MOM_input and don’t have a MOM_layout file. With a MOM_layout file presumably you could set in the MOM_layout file instead?

We might need a MOM person !

Ok trying to add AUTO_MASKTABLE = True in MOM_input or MOM_layout with or without the MASKTABLE variable set produces the following payu error:

ValueError: OCN_modelio pio_root exceeds available PEs (max: 0) in nuopc.runconfig.

Which refers to this section of nuopc.runconfig

OCN_modelio::
     diro = ./log
     logfile = ocn.log
     pio_async_interface = .false. #not used
     pio_netcdf_format = 64bit_offset #not used
     pio_numiotasks = -99 #not used
     pio_rearranger = 2 #not used
     pio_root = 1 #not used
     pio_stride = 48 #not used
     pio_typename = netcdf #not used, set in input.nml
::

I keep going with compiling MOM6 with debug flags and then attaching a debugger to it.

I might start a separate hive thread.