PBS: job exceeds memory when executing `nccmp` many times

In a PBS job script, I run a number of binary comparisons between netcdf files using the nccmp command (see below). The file size of the netcdf files being compared range from 15M to 92M. The script executes the binary comparisons successfully, but the peak memory usage of the script reported by PBS exceeds the typical memory usage of executing nccmp for any of the netcdf files alone.

For example, running nccmp for two files of size 92M uses about 230M of memory. Running the script: executing 52 comparisons uses about 5G of memory, and executing 332 comparisons uses about 30G of memory. (Note, the memory usage is the peak memory usage reported by PBS, i.e. resources_used.mem).

Does anyone have any idea why the peak memory usage scales with the number of comparisons we execute with nccmp in the script?

Here is the script used to execute binary comparisons:

#!/bin/bash
#PBS -l wd
#PBS -l ncpus=1
#PBS -l mem=32GB
#PBS -l walltime=1:00:00
#PBS -P tm70
#PBS -j oe
#PBS -m e
#PBS -l storage=gdata/hh5+scratch/tm70

module purge
module use /g/data/hh5/public/modules
module load conda

output_dir=more_outputs  # path to netcdf files
R0_files=($output_dir/*_R0_*)
R1_files=($output_dir/*_R1_*)

if [ ${#R0_files[@]} -ne ${#R1_files[@]} ]; then
    echo "Error: number of R0 files unequal to number of R1 files."
    exit 1
fi

for ((i=0; i<${#R0_files[@]}; i++)); do
    echo "nccmp -df ${R0_files[i]} ${R1_files[i]}"
    nccmp -df ${R0_files[i]} ${R1_files[i]}
done

To reproduce:

The script and netcdf files used are accessible on NCI: /scratch/public/sb8430/memory-issue.
To run the script, simply submit the script to the PBS scheduler:

qsub compare_files_serial.pbs

There are three directories that contain netcdf outputs which contain varying numbers of netcdf files: outputs, more_outputs and many_outputs. To run the largest number of comparisons (332 comparisons), set output_dir=many_outputs in the script.

1 Like

Hi @SeanBryan51

I’ve seen similar behaviour in my PBS jobs. My suspicion is that you’re actually seeing the cumulative memory usage of all processes that have ever run within that job, capped at the requested memory. I believe PBS gathers memory usage from the job cgroup, perhaps it keeps a lookup table of pids + memory usage and sums everything periodically, whether those tasks are still running or not? That being said, NCI kills jobs based on instantaneous memory usage, using a separate mechanism to the one used by PBS to monitor memory. If your jobs aren’t actually being killed, I suspect you’d be safe to drop the memory request down to 4GB, regardless of how many files your comparing.

Dale

3 Likes