Thanks for these efforts @jemmajeffree et al.
FYI for all - if you want a good, basic overview of Xarray’s “groupby” from Deepak spend 37 minutes on this > https://youtu.be/92-QU37W9WI?si=a1XgLxqsjygjPyfu
Thanks for these efforts @jemmajeffree et al.
FYI for all - if you want a good, basic overview of Xarray’s “groupby” from Deepak spend 37 minutes on this > https://youtu.be/92-QU37W9WI?si=a1XgLxqsjygjPyfu
I think this needs to be posted at xarray repo?
I admit I don’t know what flox is… But the example above with groupby giving NaNs seems very counter-intuitive to me (as did the initial post by @hrsdawson)!!
edit: I actually missed reading the “I’m planning on raising the bug with the xarray…” bit by @jemmajeffree; nice!
Well if indeed somehow groupby
results in the computations following it being done at float32 instead of float64 that might explain things! For datasets where the variation is close to the float32 presicion then adding/subtracting/etc gives nonsense (including variances that are negative) and thus NaN when sqrt
is taken.
I’m just lurking but I am curious (and a bit anxious). Was an issue posted on GitHub about this in the end? Could it be related to issues such as bottleneck : Wrong mean for float32 array · Issue #1346 · pydata/xarray · GitHub? If using Dask toggles a precision switch that silently returns wrong outputs this is a pretty big deal, right?
Hi Benoit and others.
I haven’t raised the issue yet, but I’ve identified the lines causing the problem and am partway through implementing a fix. It’s not toggling a precision switch, it’s just a different numerical implementation of standard deviation that gets noisy when the standard deviation is tiny compared to the mean (i.e deep ocean salinity).
Because I had to finish something for a deadline yesterday, I only got a chance to dive into the code this morning. The discrepancy comes down to flox using these lines with numpy/loaded data and these lines when using dask arrays and lazy data; the numpy version starts by subtracting an offset from the array so there isn’t such a huge difference between the magnitude of mean and standard deviation. I’m in the process of implementing the same step into the dask implementation of flox. It’s working for the single-dimension case but I still need to generalise to work with any number of dimensions. Once I’ve got this done (hopefully Monday) I’ll post the issue and fix on github
I’ve now raised the issue on the flox repo:
I didn’t quite manage to generalise to multiple dimensions, hoping someone will give me a hand with that.
For the moment, subtract the mean before taking standard deviations with groupby, if the mean is big compared to the expected standard deviation