Inconsistent (and confusing) relative speeds of various xarray operations

I’ve been trying to shave some time off a couple of computations, and have noticed that the behaviour of some xarray functions is inconsistent across different computing platforms, and inconsistent with what I’d expect. I’m hoping someone here can clarify what’s going on.

I’ve been using the following code as my test case, which is similar to my actual code where I need to subset and then calculate something (on repeat, so it’s worth optimising properly)

import numpy as np
import xarray as xr
import time

# Fill up memory with some unreasonably large dataset (just under 4GB)
print('Data Generation')
%time data = xr.DataArray(np.random.normal(0,1,(10**4,10**4)))

#negligible time
mask = np.zeros(10**4)
mask[::10] = 1
mask_xr = xr.DataArray(mask)

# Test a variety of functions
%time small_data = data.where(mask_xr)

print('\nxr.DataArray.where (drop = True)')
%time small_data = data.where(mask_xr,drop=True)

%time small_data = data.shift(dim_0 = 10)

%time small_data = data.sel(dim_0 = mask_xr.astype(bool))

print('\nnp.log (full dataset)')
%time np.log(data)

print('\nnp.log (small dataset)')
%time np.log(small_data)

Results are below - all times in seconds:

gadi – broadwell (2 cores, 9GB RAM) My mac – M2 (10 cores, 32GB RAM) casper – I think skylake (2 cores, 9.5GB RAM)
xr.DataArray.where 0.3 0.15 1-6
xr.DataArray.where (drop = True) 0.3 0.15 1-6
xr.DataArray.shift 0.6 0.25 10-12
xr.sel 0.03 0.01 0.04-0.06
np.log (full data) 2 0.6 4.7
np.log (small data) 0.2 0.06 0.6

I’m finding that xr.DataArray.where and xr.DataArray.shift are much slower than I expected, compared to the other two. I would have assumed that these returned views of the DataArray, rather than copies, but the speed is suggesting otherwise. I can’t find anything in the docs’ function descriptions to suggest where or shift should behave differently to sel, memory wise. If it was a memory issue, I’d also expect np.where to be faster with drop=True than drop=False, but it isn’t.

Additionally, I’m perplexed by just how much the relative speeds vary on different machines. On my mac and on gadi-broadwell, np.log is about four times slower than the various subsetting operations. On NCAR’s casper (I think a skylake node), np.log is faster, by up to an order of magnitude. As a bonus, the gadi stats were all 10 times slower on the second ARE session I started, and I’ve got no idea why.

Does anyone know how these functions and computers behave under the hood that’s causing the difference? I assume the differences between machines is something to do with how RAM is handled, rather than np.log being a bad benchmark - I was finding the same differences with my original problem involving matrix multiplication.

Furthermore, if anyone has access to another supercomputer or has a different chip in their local computer, I’d be interested to see their output from the same test code.

Hi @jemmajeffree. Welcome to the world of software benchmarking on production systems. There are so many factors involved in this that its hard to know where to start. I think you have the expected result when it comes to the M2 vs. Gadi Broadwell. Core-for-core the M2 is far and away faster than anything else on that list, so it should perform pretty well on maths-heavy workflows like log. I’m surprised its only twice as fast as Gadi when it comes to the more memory intensive operations. The large variance in timing on casper leads me to believe you might have a either data locality or resource contention issue. On Gadi, the scheduler does its best to organise jobs such that all the compute resources are as topologically close together as possible. If casper isn’t guaranteeing NUMA, CPU or even node locality for each of those two processes, then variable performance is to be expected. How well does casper keep jobs isolated from one another? This is another thing NCI takes very seriously. On Gadi, whatever resources you request are dedicated to your jobs by way of linux cgroups. This isn’t the default behaviour for all schedulers.

I guess what I’m trying to get at here is that you could have two systems that are identical in hardware, but the various network, storage, OS, scheduler and other design decisions made along the way can drastically change performance. As well as things like the current workload on the system. I challenge you to try and get a consistent storage benchmark result on any of the /g/data filesystems! There is a reason NCI does all of their benchmarking before systems go live.

If you are keen to try other architectures, Gadi has 4 available (technically 5, but that’s not what the A100 GPU nodes are for), Broadwell, Skylake, Cascade Lake and Sapphire Rapids accessible through the various queues. See here: Queue Structure - NCI Help - Opus - NCI Confluence.

ETA: Oh, wait, this is a single core job. Disregard the part about node locality, but everything else stands. It is possible to allocate memory on the ‘wrong’ DIMMs. Performance drops (relatively) drastically if the OS has allocated memory on the DIMMs belonging to the ‘other’ CPU on a dual socket node.

1 Like