I’ve been trying to shave some time off a couple of computations, and have noticed that the behaviour of some xarray functions is inconsistent across different computing platforms, and inconsistent with what I’d expect. I’m hoping someone here can clarify what’s going on.
I’ve been using the following code as my test case, which is similar to my actual code where I need to subset and then calculate something (on repeat, so it’s worth optimising properly)
import numpy as np
import xarray as xr
import time
# Fill up memory with some unreasonably large dataset (just under 4GB)
print('Data Generation')
%time data = xr.DataArray(np.random.normal(0,1,(10**4,10**4)))
#negligible time
mask = np.zeros(10**4)
mask[::10] = 1
mask_xr = xr.DataArray(mask)
# Test a variety of functions
print('\nxr.DataArray.where')
%time small_data = data.where(mask_xr)
print('\nxr.DataArray.where (drop = True)')
%time small_data = data.where(mask_xr,drop=True)
print('\nxr.DataArray.shift')
%time small_data = data.shift(dim_0 = 10)
print('\nxr.DataArray.sel')
%time small_data = data.sel(dim_0 = mask_xr.astype(bool))
print('\nnp.log (full dataset)')
%time np.log(data)
print('\nnp.log (small dataset)')
%time np.log(small_data)
Results are below - all times in seconds:
gadi – broadwell (2 cores, 9GB RAM) | My mac – M2 (10 cores, 32GB RAM) | casper – I think skylake (2 cores, 9.5GB RAM) | |
---|---|---|---|
xr.DataArray.where | 0.3 | 0.15 | 1-6 |
xr.DataArray.where (drop = True) | 0.3 | 0.15 | 1-6 |
xr.DataArray.shift | 0.6 | 0.25 | 10-12 |
xr.sel | 0.03 | 0.01 | 0.04-0.06 |
np.log (full data) | 2 | 0.6 | 4.7 |
np.log (small data) | 0.2 | 0.06 | 0.6 |
I’m finding that xr.DataArray.where
and xr.DataArray.shift
are much slower than I expected, compared to the other two. I would have assumed that these returned views of the DataArray, rather than copies, but the speed is suggesting otherwise. I can’t find anything in the docs’ function descriptions to suggest where or shift should behave differently to sel
, memory wise. If it was a memory issue, I’d also expect np.where to be faster with drop=True
than drop=False
, but it isn’t.
Additionally, I’m perplexed by just how much the relative speeds vary on different machines. On my mac and on gadi-broadwell, np.log
is about four times slower than the various subsetting operations. On NCAR’s casper (I think a skylake node), np.log is faster, by up to an order of magnitude. As a bonus, the gadi stats were all 10 times slower on the second ARE session I started, and I’ve got no idea why.
Does anyone know how these functions and computers behave under the hood that’s causing the difference? I assume the differences between machines is something to do with how RAM is handled, rather than np.log
being a bad benchmark - I was finding the same differences with my original problem involving matrix multiplication.
Furthermore, if anyone has access to another supercomputer or has a different chip in their local computer, I’d be interested to see their output from the same test code.