Possible bug(?) in xarray.groupby('time.month') operations

Thomas-Moore · 31 January 2025 02:26

Thanks for these efforts @jemmajeffree et al.

FYI for all - if you want a good, basic overview of Xarray’s “groupby” from Deepak spend 37 minutes on this > https://youtu.be/92-QU37W9WI?si=a1XgLxqsjygjPyfu

navidcy · 4 February 2025 01:44

I think this needs to be posted at xarray repo?

I admit I don’t know what flox is… But the example above with groupby giving NaNs seems very counter-intuitive to me (as did the initial post by @hrsdawson)!!

edit: I actually missed reading the “I’m planning on raising the bug with the xarray…” bit by @jemmajeffree; nice!

navidcy · 5 February 2025 16:35

Well if indeed somehow groupby results in the computations following it being done at float32 instead of float64 that might explain things! For datasets where the variation is close to the float32 presicion then adding/subtracting/etc gives nonsense (including variances that are negative) and thus NaN when sqrt is taken.

Benoit · 13 February 2025 23:40

I’m just lurking but I am curious (and a bit anxious). Was an issue posted on GitHub about this in the end? Could it be related to issues such as bottleneck : Wrong mean for float32 array · Issue #1346 · pydata/xarray · GitHub? If using Dask toggles a precision switch that silently returns wrong outputs this is a pretty big deal, right?

jemmajeffree · 14 February 2025 06:20

Hi Benoit and others.

I haven’t raised the issue yet, but I’ve identified the lines causing the problem and am partway through implementing a fix. It’s not toggling a precision switch, it’s just a different numerical implementation of standard deviation that gets noisy when the standard deviation is tiny compared to the mean (i.e deep ocean salinity).

Because I had to finish something for a deadline yesterday, I only got a chance to dive into the code this morning. The discrepancy comes down to flox using these lines with numpy/loaded data and these lines when using dask arrays and lazy data; the numpy version starts by subtracting an offset from the array so there isn’t such a huge difference between the magnitude of mean and standard deviation. I’m in the process of implementing the same step into the dask implementation of flox. It’s working for the single-dimension case but I still need to generalise to work with any number of dimensions. Once I’ve got this done (hopefully Monday) I’ll post the issue and fix on github

jemmajeffree · 17 February 2025 01:29

I’ve now raised the issue on the flox repo:

github.com/xarray-contrib/flox

Suggested change to std/var preprocessing to improve precision

opened 01:24AM - 17 Feb 25 UTC

jemmajeffree

Hi, I've noticed that in a few rare situations, groupby and flox can return quit…e noisy standard deviations. In situations where the mean of an array is much larger than the standard deviation (such as deep ocean salinity, raised [here]()), flox returns noisier values on dask arrays than on loaded numpy arrays. In extreme situations, the standard deviation of a dask array can contain NaNs from square-rooting negative variances. I'm guessing it's the same idea as #386, in which case @dcherian has thought about this for much longer than I have. I've done a little bit of looking through the code, and could easily have missed something about how this works with neighbouring functions, but my thoughts on the potential problem and how it might be addressed are below. Minimal complete verifiable example: ```python import numpy as np import xarray as xr l =12000 np.random.seed(1) test_data = xr.DataArray(np.random.uniform(0,1,l)/100+1000000,dims=('time',) # huge mean with relatively small variability ).assign_coords({'month':xr.DataArray(np.arange(l)%12,dims=('time',))}) # with numpy arrays returns reasonable and consistent values test_data.groupby('month').std('time') # array([0.00283648, 0.00281895, 0.00287791, 0.00287652, 0.00287337, # 0.00287037, 0.00289802, 0.00289441, 0.00285839, 0.00296478, # 0.00284787, 0.00292089]) # using lazy computation/dask dask_test_data = test_data.chunk({'time':100}) dask_test_data.groupby('month').std('time').load() # array([0.01118034, 0.01118034, 0.01118034, 0.01581139, 0. , # 0.01581139, 0.01118034, 0.01118034, 0.01118034, nan, # nan, 0. ]) ``` A functional workaround is to subtract the mean before calculating standard deviation: ```python (dask_test_data.groupby('month')-dask_test_data.groupby('month').mean('time')).groupby('month').std('time').load() ``` My understanding is that the distinction comes from aggregate_npg.py improving precision by [subtracting the first non-nan element of the array](https://github.com/xarray-contrib/flox/blob/ca576812e78b3978421eace6e9dde5a76729ebcc/flox/aggregate_npg.py#L112), a preprocessing step [skipped by aggregations.py](https://github.com/xarray-contrib/flox/blob/ca576812e78b3978421eace6e9dde5a76729ebcc/flox/aggregations.py#L379). This solution is probably not quite as stable as subtracting the mean, but the first element should be really close to the mean if the standard deviation is small, and it might be faster. I’d suggest that to improve precision and match the numpy engine behaviour in aggregations_npg.py, the flox engine implementation for dask arrays of nanstd,nanvar,std,var could have a preprocessor that looks something like this: ```python def var_std_preprocess(array, axis): # Not sure of naming conventions, sorry """Subtracts first value of array from whole array, to improve numerical precision of nanstd, nanvar, std, var Adapted from from argreduce_preprocess and _var_std_wrapper in aggregate_npg.py """ import dask.array # Copied from argreduce_preprocess, but maybe these shouldn’t be within the function? import numpy as np # For either this function or argreduce_preprocess? # NEXT LINE IS PSEUDOCODE; I’m not entirely sure how to apply it lazily # If it doesn’t cost anything speed wise, then probably better to use mean. Happy to run some time tests on either first_elements = nanfirst(array,axis) def subtract_first(array_, first_elements_): return array_-first_elements_ return dask.array.map_blocks( subtract_first, array, first, dtype=array.dtype, meta=array._meta, name="groupby-var_std-preprocess", ) ``` and is included in the Aggregations definition like so: ```python nanstd = Aggregation( "nanstd", preprocess=var_std_preprocess, #UPDATED LINE chunk=("nansum_of_squares", "nansum", "nanlen"), combine=("sum", "sum", "sum"), finalize=_std_finalize, fill_value=0, final_fill_value=np.nan, dtypes=(None, None, np.intp), final_dtype=np.floating, ) ``` It seems to work if `first_elements` is naively `array[0]` in the one-dimensional, no-nans case, but I’m not sure how to generalise it and apply nanfirst without the usual layers/wrappers. (aggregate_npg.py uses `first = _get_aggregate(engine).aggregate(group_idx, array, func="nanfirst", axis=axis)`, but I don't think this syntax translates to the flox/dask implementation) . If you can give me a few tips or examples to work from, then I’m happy to try implement this behaviour. Happy also to discuss alternatives, or to provide a pull request if that's easier to work with. This is also my first time reading through flox code in detail (it's really nicely written and documented, by the way, was lovely to read), and one of the first times I’ve interacted with public github repos, so I’d appreciate any feedback or corrections on what's useful to provide when describing issues.

I didn’t quite manage to generalise to multiple dimensions, hoping someone will give me a hand with that.

For the moment, subtract the mean before taking standard deviations with groupby, if the mean is big compared to the expected standard deviation

Topic		Replies	Views
Lazy sorting with dask (bootstrapping problem) Technical python , dask , xarray	3	297	29 June 2023
2nd May 2025 - Unlocking the Power of Xarray and Dask 2025 training program	7	188	15 May 2025
Xarray warnings while loading data using cosima cookbook Technical python , help , cosima	16	480	2 December 2024
A reference for making dask work (faster) Technical dask , knowledge-base	11	1364	22 July 2025
Convert daily-to-monthly xarray using customized calendar Technical python , xarray	2	225	19 September 2023

Possible bug(?) in xarray.groupby('time.month') operations

Related topics