Xarray changes in handling ERA5 and ERA5-Land data

Hi All. I’ve been doing a bit of work using ERA5 and ERA5-Land data recently, and I’ve come across a bit of a discrepancy in how this data is handled between older and newer versions of Xarray. Since Xarray version 2024.3.0, the dtype for ERA5 and ERA5-Land data returned from xarray.open_dataset() has been changed from np.float32 to np.float64. This is due to changes in this function. On NCI, ERA5 and ERA5-Land data is stored as int16 data with scale and offset factors. In earlier versions of xarray, this function would act on the int16 type and return the fields as float32’s as float32 can exactly represent all integers up to 24 bits. In the later versions, it is returning the fields in the same type as the scale factor, which is float64. This leads to small discrepancies at the limit of float32 precision that compound when used as initial conditions in simulations. Below is the difference in skt ERA5 variable returned by Xarray 2023.12.0 and 2024.5.0 for the AUS2200 domain
era5grib_skt
At this stage, its unstructured noise at a relative value of around 10-6, (~0.0002K) which is what you’d expect. Every field converted from ERA5 would exhibit something like this. When passed through the reconfiguration program to create the initial conditions for the model, some structure begins to emerge
um_recon_skt.
I then looked at what effect this had on the model, by running 12 hours of AUS2200 with both sets of initial conditions. The results are interesting. The animation below shows the difference in air temperature at 1.5m as the model progresses.
tmp_output
This plot is an absolute difference. By the end of the 12 hour run, a significant difference is visible across a lot of the domain. I also took a look at the rainfall amounts, and this does appear to have shifted the rainfall a small amount.
rain_output

The affected versions of Xarray are present in conda/analysis3-24.01 and later, so anyone who’s using ERA5 and ERA5-Land to initialise models (including via era5grib) should consider sticking to earlier versions of the analysis3 environments if they’re looking to reproduce or continue earlier model runs. Our AUS2200 runs have been using conda/analysis3-23.01, so they’re consistent in that regard. I’m currently working on an upgrade for era5grib, and I’ll attempt to reverse engineer the type conversion behaviour of the older Xarray versions as a compatibility mode to ensure that this kind of discrepancy doesn’t show up when the new version is installed.

5 Likes

Good find, @dale.roberts. I think you’ve found the source of my confusion over float32/64 when loading BRAN2020 netcdf file short variables? confirm (??) and fix zarr collection precision ( float32 vs float64 ) · Issue #13 · Thomas-Moore-Creative/Climatology-generator-demo · GitHub

1 Like

Just to follow up on this, I’ve made this xarray_float32_read.ipynb · GitHub which essentially has Xarray open the data in its native dtype (int16) and does the conversion to float32 manually. I’ve confirmed that the dataset output by this function is bitwise reproducible with xr.open_dataset() from Xarray < 2024.3.0.

1 Like