Python garbage collection seems to be failing me?

I am trying to write a notebook about handling large data and found an interesting quirk …

I have one dataset, with very small chunks, I combine the chunks using an xarray chunk function and then calculate a mean. I am finding that when I do this multiple times consecutively (or in different combinations) it just fills up the RAM on my ARE instance which doesn’t get deleted automatically. Eventually the kernel just crashes (even though the stuff filling RAM is not explicitly held in variables or anything). Adding lots of gc.collect() calls fixes the problem, but is definitely not what I expected to have to do?

The notebook is here: training-day-2024-find-analyse-data/intake/Large_data.ipynb at main · ACCESS-NRI/training-day-2024-find-analyse-data · GitHub

The dataset:

New chunks:

Plot with large garbage collection number:

When I do the last two steps with different combinations of (sensible) chunksizes, it crashes the kernel. See
The notebook is here: training-day-2024-find-analyse-data/intake/Large_data.ipynb at main · ACCESS-NRI/training-day-2024-find-analyse-data · GitHub , which will crash without the gc.collection calls …

Yep I think this is a pretty common problem.

Could the Out[] object in ipython/jupyter be to blame? Although the results aren’t specifically kept by name, they still exist in that object (i.e. Out[30] in your screenshot).

You may be forcing lots of intermediate data to be cleaned up, which is giving you sufficient room to perform the subsequent calculations. I wonder if you could do %reset -f out in between calculations (instead of gc.collect()), or print the results of chunking, or plot it directly?