Cannot print to file - possible Dask issue in xp65

nbenger · 9 May 2025 01:10

Hi help staff!

I had some code working in the hh5 env which is now crashing frequently in xp65.

I am printing to file. The code that was working on hh5:
file_name_vpd = pathway + '_' + GCM + '_' + RCM + '_gwl' + chosen_gwl + '_vpd.nc' 
output_file_location = output_dir + GCM + '/' + pathway + '/' + ensemble + '/' + RCM + '/v1-r1/day/' + file_name_vpd
gwl_vpd.to_netcdf(output_file_location, engine='netcdf4')
print(output_file_location)

It works a bit better if I break the files in half and specify threads_per_worker=1, but this creates downstream issues of course and is slower. The code which is “working” in xp65 (basically the same as above, but with two cases based on first or second half of the ds):

halves = ['1st', '2nd']

for half in halves:
    print(half)
    if half == '1st':
        gwl_vpd_print = gwl_vpd['vpd'][0:2737]
    else: 
        gwl_vpd_print = gwl_vpd['vpd'][2737:]
        
    file_name_vpd = half + '_' + pathway + '_' + GCM + '_' + RCM + '_gwl' + chosen_gwl + '_vpd.nc'
    output_file_location = output_dir + GCM + '/' + pathway + '/' + ensemble + '/' + RCM + '/v1-r1/day/' + file_name_vpd
    gwl_vpd_print.to_netcdf(output_file_location, engine='netcdf4')
        
    print('half done')
    print(output_file_location)

This will often still crash and require file deletion and relaunch.

Errors generally relate to file access. Pages long, the final statement being:

OSError: [Errno -101] NetCDF: HDF error: b'/g/data/ia39/ncra/bushfire/vpd/EC-Earth3/ssp370/r1i1p1f1/BARPA-R/v1-r1/day/ssp370_EC-Earth3_BARPA-R_gwl1.5_vpd.nc'

Packages I have imported:

#import all the stuff
from netCDF4 import Dataset
import xarray as xr
import numpy as np
import pandas as pd
from datetime import timedelta
import matplotlib.pyplot as plt
import glob
import sys
sys.path.append("/g/data/mn51/users/nb6195/project/gwls/")
import gwl

Would appreciate any help!

Thanks.

CharlesTurner · 12 May 2025 01:47

Hi Naomi,

I’ve been unable to test anything on Gadi as ARE is down & it seems my job submissions aren’t working either, but I’ve had a read through & I don’t think that gwl_vpd_print is defined anywhere in the code snippets you provided.

Would you be able to provide (probably when Gadi comes back up assuming you’ll be affected by the same issues as me) a script that will allow me to try to write the same xarray dataset to a file so I can reproduce the issue.

With that said, I suspect this might be another manifestation of this pesky netcdf thread safety issue. I wonder whether specifying engine='h5netcdf' might go some way to fixing that.

nbenger · 12 May 2025 02:12

Hi Charles,
thanks for your message. Yes, same ARE/Gadi issue. I have my full notebook in GitHub, I can share that with you. When ARE is back online, I will push the most recent code and send you a link.

Thanks for your help!

jemmajeffree · 12 May 2025 02:16

Quick clarification point:
Sometimes to_netcdf fails because it triggers a load on data, not because writing that data to file is the problem. Do you get the same error if you try gwl_vpt_print.load(), or does it only occur with gwl_vpt_print.to_netcdf(output_file_location)?
(Having said that, the fact that you’re getting a HDF5 error specifically reduces the chance of the load being the problem)

nbenger · 12 May 2025 02:28

Thanks for the suggestion. I haven’t tried gwl_vpt_print.load() but could give it a go once ARE is back up.

nbenger · 14 May 2025 00:14

The full code is here: nb-bom/VPD: Create vapour pressure deficit projections in the VPC functions.ipynb

CharlesTurner · 14 May 2025 00:53

I’ve requested to join kj66, when I’m added I’ll try to reproduce the issue.

CharlesTurner · 14 May 2025 03:43

Hi Naomi, please could you create a minimum working example with just the lines of code needed to reproduce the failure? I’m trying to run the VPD Functions.ipynb notebook in the provided github repo, which looks to be the right one, but in. eg this cell:

#read in RCM files
var1 = 'tasmax'

# ddir = f"/g/data/ia39/australian-climate-service/release/CORDEX/output-Adjust/{CMIP}/bias-adjusted-input/AUST-05i/{AGENCY}/{GCM}/{pathway}/{ensemble}/{RCM}/v1-r1/day/{var1}/v20241216"
# infiles1=glob.glob(ddir+f'/{var1}_AUST-05i_{GCM}_{pathway}_{ensemble}_{AGENCY}_{RCM}_v1-r1_day_*.nc')
# tasmax_master_ds = xr.open_mfdataset(infiles1)

ddir = f"/g/data/kj66/CORDEX/output/{CMIP}/DD/AUST-05i/{AGENCY}/{GCM}"
infiles1a=glob.glob(ddir+f'/historical/{ensemble}/{RCM}/v1-r1/day/{var1}/v20241216//{var1}_AUST-05i_{GCM}_historical_{ensemble}_{AGENCY}_{RCM}_v1-r1_day_*.nc')
infiles1b=glob.glob(ddir+f'/{pathway}/{ensemble}/{RCM}/v1-r1/day/{var1}/v20241216/{var1}_AUST-05i_{GCM}_{pathway}_{ensemble}_{AGENCY}_{RCM}_v1-r1_day_*.nc')
tasmax_master_ds = xr.open_mfdataset(infiles1a + infiles1b)

I’m not finding any files in the infiles1a or infiles1b globs. If I comment out that block and uncomment the above block, I get some files that I can open - but I assume these will be the wrong files to reproduce the error.

I’m still waiting mn51 membership in order to run the functions from the gwl package.

EDIT: Sorry, ignore that, didn’t have the correct projects added.

CharlesTurner · 14 May 2025 04:16

Assuming I’m trying to reproduce the failure correctly, I’m able to run the following code fine now, using conda/analysis3-25.05 on a large ARE instance:

import dask
from dask.distributed import Client, wait
from dask import delayed

client = Client(n_workers=7, threads_per_worker=1) 

#import all the stuff
from netCDF4 import Dataset
import xarray as xr
import numpy as np
import pandas as pd
from datetime import timedelta
import matplotlib.pyplot as plt
import glob
import sys
sys.path.append("/g/data/mn51/users/nb6195/project/gwls/")
import gwl

#function to compute VPD from tasmax and rh datasets
#input: are datasets of rh and tasmax
#output: datasets of vpd and monthly_mean_vpd

def vpd_calc(ds_rh, ds_tasmax):
#    vpd = (1 - ds_rh/100) * 0.61094 * np.exp((17.652 * ds_tasmax)/(243.04 + ds_tasmax)) #originally used funtion
    vpd = (1 - ds_rh/100) * 6.1094 * np.exp ((17.625 * ds_tasmax)/(243.04 + ds_tasmax)) #the formula that Blair uses from http://www.bom.gov.au/research/publications/cawcrreports/CTR_024.pdf
    monthly_mean_vpd = vpd.groupby('time.month').mean('time', keep_attrs=True)
    
    vpd.attrs = {
        'long_name': 'Daily maximum vapour dressure deficit computed from tasmax and hursmin',
        'standard_name': 'vpd',
        'units': 'hPa',
#        'regrid_method': 'bilinear'
    }
    ds_vpd = xr.Dataset({'vpd' : vpd})
    ds_rh.close()
    ds_tasmax.close()
    
    monthly_mean_vpd.attrs = {
        'long_name': 'Monthly mean vapour dressure deficit computed from tasmax and hursmin',
        'standard_name': 'monthly_mean_vpd',
        'units': 'hPa',
    }
    ds_monthly_mean_vpd = xr.Dataset({'monthly_mean_vpd' : monthly_mean_vpd})
    return ds_vpd, ds_monthly_mean_vpd

#Set parameters
CMIP='CMIP6'
AGENCY = 'BOM' 
RCM = 'BARPA-R'
GCM = 'MPI-ESM1-2-HR' 
ensemble = 'r1i1p1f1' #BOM done, no CSIRO
pathway = 'ssp370'

output_dir = '/scratch/tm70/ct1163/naomi/VPD/vpd/'
output_dir_mm = '/scratch/tm70/ct1163/naomi/VPD/vpd/monthly_mean/'

#read in RCM files
var1 = 'tasmax'

ddir = f"/g/data/kj66/CORDEX/output/{CMIP}/DD/AUST-05i/{AGENCY}/{GCM}"
infiles1a=glob.glob(ddir+f'/historical/{ensemble}/{RCM}/v1-r1/day/{var1}/v20241216//{var1}_AUST-05i_{GCM}_historical_{ensemble}_{AGENCY}_{RCM}_v1-r1_day_*.nc')
infiles1b=glob.glob(ddir+f'/{pathway}/{ensemble}/{RCM}/v1-r1/day/{var1}/v20241216/{var1}_AUST-05i_{GCM}_{pathway}_{ensemble}_{AGENCY}_{RCM}_v1-r1_day_*.nc')
tasmax_master_ds = xr.open_mfdataset(infiles1a + infiles1b)

var2 = 'hursmin'

ddir2 = f"/g/data/kj66/CORDEX/output/{CMIP}/DD/AUST-05i/{AGENCY}/{GCM}"
infiles2a=glob.glob(ddir+f'/historical/{ensemble}/{RCM}/v1-r1/day/{var2}/v20241216/{var2}_AUST-05i_{GCM}_historical_{ensemble}_{AGENCY}_{RCM}_v1-r1_day_*.nc')
infiles2b=glob.glob(ddir+f'/{pathway}/{ensemble}/{RCM}/v1-r1/day/{var2}/v20241216/{var2}_AUST-05i_{GCM}_{pathway}_{ensemble}_{AGENCY}_{RCM}_v1-r1_day_*.nc')
hursmin_master_ds = xr.open_mfdataset(infiles2a + infiles2b)

#Extract time period corresponding to the chosen GWL for tasmax and rh
chosen_gwl = '1.5'

gwl_tasmax = gwl.get_GWL_timeslice(tasmax_master_ds,CMIP,GCM,ensemble,pathway,GWL=chosen_gwl)[var1]
gwl_rh = gwl.get_GWL_timeslice(hursmin_master_ds,CMIP,GCM,ensemble,pathway,GWL=chosen_gwl)[var2]

gwl_vpd, monthly_mean_vpd = vpd_calc(gwl_rh, gwl_tasmax)


file_name_mean = 'gwl' + chosen_gwl + '_monthly_mean_vpd_' + GCM + '_' + RCM + '_' + pathway + '_' + ensemble + '.nc' 
output_file_location = output_dir_mm + file_name_mean
monthly_mean_vpd.to_netcdf(output_file_location, engine='netcdf4')
print(output_file_location)

$ !ncdump -h /scratch/tm70/ct1163/naomi/VPD/vpd/monthly_mean/gwl1.5_monthly_mean_vpd_MPI-ESM1-2-HR_BARPA-R_ssp370_r1i1p1f1.nc

netcdf gwl1.5_monthly_mean_vpd_MPI-ESM1-2-HR_BARPA-R_ssp370_r1i1p1f1 {
dimensions:
	lat = 691 ;
	lon = 886 ;
	month = 12 ;
variables:
	double lat(lat) ;
		lat:_FillValue = NaN ;
		lat:units = "degrees_north" ;
		lat:axis = "Y" ;
		lat:bounds = "lat_bnds" ;
		lat:long_name = "latitude" ;
		lat:standard_name = "latitude" ;
	double lon(lon) ;
		lon:long_name = "longitude" ;
		lon:_FillValue = NaN ;
		lon:axis = "X" ;
		lon:standard_name = "longitude" ;
		lon:bounds = "lon_bnds" ;
		lon:units = "degrees_east" ;
	int64 month(month) ;
	float monthly_mean_vpd(month, lat, lon) ;
		monthly_mean_vpd:_FillValue = NaNf ;
		monthly_mean_vpd:long_name = "Monthly mean vapour dressure deficit computed from tasmax and hursmin" ;
		monthly_mean_vpd:standard_name = "monthly_mean_vpd" ;
		monthly_mean_vpd:units = "hPa" ;
}

I’ll leave it running on a loop for a while & see if I can reproduce the crash.

In the meantime, did you happen to be using a different environment version or a smaller ARE instance?

nbenger · 14 May 2025 04:50

Sorry, just saw your messages, yes, that points to a different directory where the times in the files are not aligned.

nbenger · 14 May 2025 04:54

I’m loading the module conda/analysis3. I have been running the “print half” version today with very few crashes, quite surprisingly - usually more frequent, only one crash today from about 15 launches to create data in a piecewise fashion.

nbenger · 14 May 2025 05:18

I just tried the code which prints the full file and got the same error quite quickly.

CharlesTurner · 14 May 2025 05:31

Interesting - I’ve been running the code sample I posted above continuously for over an hour without any crashes.

Could you post a screenshot of your ARE session config? I’m wondering what’s different between our setups which would be causing this.

nbenger · 14 May 2025 05:43

I’m surprised it is running that long! Or did you mean you keep launching and printing more files? I’ll get a screenshot now

nbenger · 14 May 2025 05:44

That’s how I launch. All other fields not shown are empty.

CharlesTurner · 14 May 2025 05:46

Ah - I think I know the problem.

Under modules, could you also add openmpi - so it contains openmpi conda/analysis3 & let me know if that fixes it?

nbenger · 14 May 2025 06:21

It worked! Thanks so much for your patience and help

Topic		Replies	Views
Strange error using xp65 Technical help , inscope	17	129	20 May 2025
Dask bag crashing in xp65 env (but wasn't in hh5) Infrastructure help , conda	22	161	22 May 2025
Error when using f2py module in new xp65 environment (but worked in hh5) General python , help , climate-conda-enviro	3	45	10 June 2025
Using dask_jobqueue in the new xp65 environment Technical help , inscope , climate-conda-enviro	10	95	4 June 2025
Using dask jobqueue with xp65 conda env Technical python , help , conda , dask	18	97	19 May 2025

Cannot print to file - possible Dask issue in xp65

Related topics