I am trying to write a notebook about handling large data and found an interesting quirk …
I have one dataset, with very small chunks, I combine the chunks using an xarray chunk function and then calculate a mean. I am finding that when I do this multiple times consecutively (or in different combinations) it just fills up the RAM on my ARE instance which doesn’t get deleted automatically. Eventually the kernel just crashes (even though the stuff filling RAM is not explicitly held in variables or anything). Adding lots of gc.collect() calls fixes the problem, but is definitely not what I expected to have to do?
The notebook is here: training-day-2024-find-analyse-data/intake/Large_data.ipynb at main · ACCESS-NRI/training-day-2024-find-analyse-data · GitHub
The dataset:
New chunks:
Plot with large garbage collection number:
When I do the last two steps with different combinations of (sensible) chunksizes, it crashes the kernel. See
The notebook is here: training-day-2024-find-analyse-data/intake/Large_data.ipynb at main · ACCESS-NRI/training-day-2024-find-analyse-data · GitHub , which will crash without the gc.collection calls …