Running ACCESS-rAM3 via bash script - can't find executable

Hi all,

I’m running ensemble members for ACCESS-rAM3 and I’ve been developing a bash script to automate the process. The problem is that when it executes rose suite-run from the bash script the session will fail at the um-recon job and in the job.err it says um-recon: command not found. I am aware this is an issue with the executable. I’ve created my own executable and linked to it in the rose-suite.conf, I have it stored at /g/data/gb02/jt0319/ACCESS_rAM3/executables. This works perfectly fine when I execute rose suite-run from my terminal - no issues. It is only when it is called from the bash script. Below is this script:

#!/bin/bash



set -euo pipefail



# Usage: bash run_ram3_ensemble_manual.sh <member_range>

# Example: bash run_ram3_ensemble_manual.sh 1,8




if [ $# -lt 1 ]; then

echo "Usage: $0 <member_range>"

exit 1

fi




MEMBER_RANGE="$1"

MEMBER_START=$(echo "$MEMBER_RANGE" | cut -d',' -f1)

MEMBER_END=$(echo "$MEMBER_RANGE" | cut -d',' -f2)




WORK_DIR="/home/272/jt0319/Code_2/Python/Reading_and_comparing_ensembles"

IC_DIR="/g/data/ng72/jt0319/ram3_start_dumps"

BASE_IC_TEMPLATE="qwqg00.reduced.2019121200.T+0"

BASE_IC_FULL="$IC_DIR/$BASE_IC_TEMPLATE"

BASE_IC_ORIG="$BASE_IC_FULL-original"




ROSES_DIR="$HOME/roses/ram3_global"

OUTPUT_FOLDER="SE_Australia"

SUITE_NAME="ram3_global"

PYTHON_BIN="${PYTHON_BIN:-python3}"

AMPLITUDE="1e-1"





#export ROSE_SITE_NAME=nci-gadi





cd "$WORK_DIR"




for member in $(seq "$MEMBER_START" "$MEMBER_END"); do

echo "=== Member $member ==="




# restore original IC

cp "$BASE_IC_ORIG" "$BASE_IC_FULL"




# load pythonlib if available (silent failure ok)

module use ~access/modules >/dev/null 2>&1 || true

module load pythonlib/umfile_utils/access_cm2 >/dev/null 2>&1 || true




# perturb IC

"$PYTHON_BIN" perturbIC.py -s "$member" -a "$AMPLITUDE" "$BASE_IC_FULL"




# start suite in background and continue immediately; name it with the member id

cd "$ROSES_DIR"


echo "Starting rose suite-run with name=${SUITE_NAME}_$member (background)"

rose suite-run --no-gcontrol --name="${SUITE_NAME}_$member" &

cd - >/dev/null




echo "Started member $member (suite name: ${SUITE_NAME}_$member)."

# give the suite time to read the IC file before starting the next member

sleep 300 # 5 minutes; adjust as needed based on how long the suite takes to start and read the IC

done




echo "All members started. Use the scanner script to collect outputs when runs finish."

How can I make the suite run from the bash script able to use the available executables like it usually does when run in terminal?

@mlipson

Hi Joel,

It sounds like the script has a different environment than your normal terminal. Can you clarify how you’re running this script, i.e. is it through the PBS or are you running directly on your login node?

But at core, it sounds like the suite doesn’t have access to your /g/data/gb02/ folder. As the suite will start up as a PBS job, we need to explicitly tell the system which g/data and scratch locations it has access to.

Your default project (I assume ng72) will always be added, but you’ll need to also allow gb02 if your files are stored there. So in your rose-suite.conf, try adding:

NCI_STORAGE="scratch/$PROJECT+gdata/$PROJECT+gdata/gb02"

A few other comments on your script.

  1. You could move the module use and module load outside the loop, as it should only need to be done once.

  2. You’re copying the “original” IC file and over-writing the IC file used by the suite (BASE_IC_FULL) each time you run the suite, then perturbing. So be aware your runs will not be reproducible as that information is overwritten. But perhaps that’s not such a big issue, as you’re just perturbing ICs infinitesimally and randomly?. But in that case could you do this perturbation in place without copying? As long as you don’t touch the “original”. [I think it might be an issue, see below]

Other comments…

It might be better to use scratch rather than gdata for your perturbed ICs. Then storage is not such an issue (but is time limited) and it might be faster to access scratch? Not sure on that one.

If you do want to keep a copy of each perturbed IC, you could give it a unique filename (with $member) and then update your rose-suite.conf programatically to point to that particular IC, i..e update:

dm_ic_file="path_to_member ic"

To do this you’d need a regex tool like sed, which can be a bit tricky, but AI will be able to help with the exact form of the sed command. The important thing is to keep the YYYYmmdd format in the dm_ic_file definition in your configuration.

I think it’s much safer this way, as otherwise you can’t be certain which IC each suite used, i.e. what happens if a suite gets stuck in a queue, delayed start, then multiple suites might be using the same IC, as it had already been updated by your copy command in the meantime.

You could also then decouple your perturbation workflow from your suite run workflow, i.e. do as many IC perturbations as you need, then afterwards run your suite-run workflow pointing to these different IC files with sed.

Hey Mat,

Thanks for your insightful comments, I’ll have a look at the NCI_STORAGE options I’ve selected currently.

On your comment about copying the original ic file and overwriting it, so it is not reproducible - each member has a random number seed that is saved so users can use the same seed to perturb the original file and get the same results. It is reproducible in this way. This is not clear from the bash script I have shown here because that is seen in the perturbIC script that others like Martin have written.

I understand your concern about multiple runs using the same IC file though, I had originally put in a timer between each simulation start but that may not be consistent as you suggest. I’ll have a think about your suggestion here - concern is the storage given each ic file is 50 GB.

Are there any suggestions concerning running multiple members each with large ic files?

Ah, that’s a good idea to use a seed that can regenerate the perturbation.

But I’d still use a unique IC filename for each member and update your rose-suite.conf with sed to point to that unique file. You can still delete the files happily afterwards, with the understanding you can recreate them with the seed.

Otherwise I can’t see a good way to run many suites in parallel and ensure they are using their proper ICs.

Even if you get up to 80 members, that is “only” 4TB on scratch, which should be ok for the day or so you’ll need to get things up and running across the ensemble. Worth checking with your project storage coordinator.

Actually on second thought, running the entire ensemble at once is probably a bad idea, as you have more storage requirements than just ICs. I think you’ll need to split it up into mini ensembles, and ensure you’re only outputting the critical information, and probably also processing it into netcdf and storing it on gdata while the suite runs. Otherwise your scratch will fill up quickly. I can help with this when I see you on friday.