Need help: Restart Wilma's 1/4 MOM ACCESS-CM2 coupled run, crashed!?!?

Hi there,

=> # Topics tagged help <=

Previous Wilma’s run is based on accessesdev, and now I try to continue the run from year 121 based on gadi.

Somehow I got the error message from the job.err:

Image              PC                Routine            Line        Source
fms_ACCESS-CM.x    0000000001B3A296  Unknown               Unknown  Unknown
fms_ACCESS-CM.x    00000000017D79C1  mpp_mod_mp_mpp_er          58  mpp_util_mpi.inc
fms_ACCESS-CM.x    0000000001505FE2  diag_util_mod_mp_        2258  diag_util.F90
fms_ACCESS-CM.x    00000000014DAD9E  diag_manager_mod_        3059  diag_manager.F90
fms_ACCESS-CM.x    00000000014CAD03  diag_manager_mod_        1757  diag_manager.F90
fms_ACCESS-CM.x    00000000014D5410  diag_manager_mod_        1392  diag_manager.F90
fms_ACCESS-CM.x    0000000000445D59  ocean_util_mod_mp         983  ocean_util.F90
fms_ACCESS-CM.x    00000000006D15C6  ocean_sbc_mod_mp_        5275  ocean_sbc.F90
fms_ACCESS-CM.x    00000000006BCD41  ocean_sbc_mod_mp_        4198  ocean_sbc.F90
fms_ACCESS-CM.x    000000000044E098  ocean_model_mod_m        1587  ocean_model.F90
fms_ACCESS-CM.x    00000000004270AB  MAIN__                    471  ocean_solo.F90
fms_ACCESS-CM.x    000000000040F4E2  Unknown               Unknown  Unknown
libc-2.28.so       000014B2C55E6D85  __libc_start_main     Unknown  Unknown
fms_ACCESS-CM.x    000000000040F3EE  Unknown               Unknown  Unknown
Image              PC                Routine            Line        Source
fms_ACCESS-CM.x    0000000001B3A296  Unknown               Unknown  Unknown
fms_ACCESS-CM.x    00000000017D79C1  mpp_mod_mp_mpp_er          58  mpp_util_mpi.inc
fms_ACCESS-CM.x    0000000001505FE2  diag_util_mod_mp_        2258  diag_util.F90
fms_ACCESS-CM.x    00000000014DAD9E  diag_manager_mod_        3059  diag_manager.F90
fms_ACCESS-CM.x    00000000014CAD03  diag_manager_mod_        1757  diag_manager.F90
fms_ACCESS-CM.x    00000000014D5410  diag_manager_mod_        1392  diag_manager.F90
fms_ACCESS-CM.x    0000000000445D59  ocean_util_mod_mp         983  ocean_util.F90
fms_ACCESS-CM.x    00000000006D15C6  ocean_sbc_mod_mp_        5275  ocean_sbc.F90
fms_ACCESS-CM.x    00000000006BCD41  ocean_sbc_mod_mp_        4198  ocean_sbc.F90
fms_ACCESS-CM.x    000000000044E098  ocean_model_mod_m        1587  ocean_model.F90
fms_ACCESS-CM.x    00000000004270AB  MAIN__                    471  ocean_solo.F90
fms_ACCESS-CM.x    000000000040F4E2  Unknown               Unknown  Unknown
libc-2.28.so       0000149F587C6D85  __libc_start_main     Unknown  Unknown

seems the issue is from diag_manager.F90, and diag_util.F90. Meaning the issue maybe is from diag_table??

The job is running under:
/home/599/ars599/cylc-run/u-db965

and set up at:
/home/599/ars599/roses/u-db965

Any suggestion?

Thanks.

Hi @ars599. If you want ACCESS-NRI support with this, please add the help tag. Also, it will be easier for people to help if they can access your run directory (I get Permission denied).

1 Like

Thanks Dougie!!

Model outputs under log sometimes are in 644 or 600, which is hard to expect.

d ~/cylc-run/u-db965/log/job/01210101/coupled/01/
total 2468
drwxr-sr-x 2 ars599 p66    4096 Mar  4 16:20 ./
drwxr-sr-x 3 ars599 p66    4096 Mar  1 10:13 ../
-rwxr-xr-x 1 ars599 p66    6258 Mar  1 10:13 job*
-rw-r--r-- 1 ars599 p66     197 Mar  1 10:16 job-activity.log
-rw-r--r-- 1 ars599 p66 1719672 Mar  1 10:16 job.err
-rw-r--r-- 1 ars599 p66  780408 Mar  1 10:16 job.out
-rw-r--r-- 1 ars599 p66     242 Mar  1 10:16 job.status

other files might need to be checked. so that people who are helping should ask rather than ask users to change all the permissions for all files. Such as those files under work:

d ~/cylc-run/u-db965/work/01210101/coupled/CPL_RUNDIR/
total 725176
drwxr-sr-x 2 ars599 p66      4096 Mar  1 10:11 ./
drwxr-sr-x 6 ars599 p66      4096 Mar  1 10:14 ../
-rw-r----- 1 ars599 p66   8424712 Mar  1 09:53 a2i.nc
-rw-r----- 1 ars599 p66 622161313 Mar  1 09:53 i2a.nc
lrwxrwxrwx 1 ars599 p66        12 Mar  1 10:11 namcouple -> ../namcouple
-rw-r----- 1 ars599 p66 111974996 Mar  1 09:53 o2i.nc

It might cause security issues for people like me who are unfamiliar with Linux.

Or will you suggest copy those to public folder?

Topic tags can be set when composing or editing the topic title. @Aidan has written a nice description of how to use them here. To tag this topic with the help tag:

  1. click on the pencil icon to the right of the topic title
  2. add “help” to the “optional tags” box

Or will you suggest copy those to public folder?

Yeah, that might be easiest, e.g. /scratch/public?

1 Like

@ars599 I hope you don’t mind, but I’ve changed the tag to help, since this specific tag triggers the support process on our end.

Could you copy all relevant logs and the diag_table to somewhere public?

@ars599, are you still wanting help with this? If you managed to resolve the issue yourself, can you provide details of the resolution here to help others in the future.

1 Like

Dear Dougie,

Sorry for the late reply. I had an emergency family issue and flight back to Taiwan two weeks ago and just came back. Maybe please help me to remove this issue. It hasn’t been solved but since there are other issues I need to fix them so that I might repost this later :slight_smile:

Regards,

Arnold

No worries @ars599. Hope all is well.

We don’t remove topics, but I’ll remove the help tag so that it doesn’t get flagged for ACCESS-NRI support. Please reply here if you find you want help again with this in the future.