What is intake-virtual-icechunk?
intake-virtual-icechunk is a Python package for building and reading Icechunk-backed catalogues from existing intake-esm datastores.
The goal is pretty simple: take an existing intake-esm datastore, build an Icechunk-backed store from it, and keep the same user-facing catalogue experience. In other words, using one of these catalogues should feel the same as using an intake-esm catalogue: the same style of search, selection, and opening data into xarray, without users needing to care about Icechunk-specific plumbing.
What are we releasing?
ACCESS-NRI has released v0.2.0 of intake-virtual-icechunk:
Together, the v0.1.0 and v0.2.0 releases move the package toward being faster, more reliable, and easier to use for building Icechunk-backed catalogues from existing intake-esm datastores.
What landed in v0.1.0?
The main milestone in v0.1.0 was that the package should now work end-to-end for both:
-
local filesystem stores (Gadi)
-
S3 / S3-compatible object stores (Pawsey)
That release also improved Ceph-backed test coverage and bumped the minimum zarr version to support rectilinear chunk grids.
What’s new in v0.2.0?
- The headline functional change in v0.2.0 is support for ingesting intake-esm datastores and reserialising them as an Icechunk store without virtualisation. That means the package can now handle a broader range of source datastores, including cases where a fully virtual workflow is not possible due to serialisation limitations.
Why a new package?
Performance
- Current intake-esm catalogues often touch many more files than is strictly necessary to extract a subset of data for analysis. On Gadi, this can often be a major and confusing performance limitation. For example, opening grid information files can often take up to a minute in intake-esm, as xarray needs to touch all matching grid files in order to open just the first. In intake-virtual-icechunk, this takes less than a second, as all this information is held within the catalog, not computed on the fly.
- No concatenation necessary: by creating an icechunk store backing a catalog, all concatenation operations are performed at build time, not read time. For a dataset backed by 500 netCDF files, this reduces the typical time to open (not even load) the dataset from around 3 minutes to about 1 and a half seconds.
Ergonomics
- Although
intake_virtual_icechunkretains the same API asintake_esm, combining all datasets into a single icechunk store greatly reduces the amount of work necessary to obtain an xarray dataset. Filtering for time ranges can now be done on the dataset objects directly, without having to worry about opening more files than necessary. - Dataset attributes are included in the datastore by default. If an attribute was written into a dataset, it will appear in the catalog, letting you search it.
Reliability
- The same icechunk store backs the catalog, and the data within it. In intake-esm, if a file is moved, deleted, or renamed, the catalog can ‘go stale’, and break in confusing ways. In intake virtual icechunk, if data is moved or deleted, the catalog will tell you what has happened.
- Catalog metadata is computed on the fly - so every time you ask for
variablesorvariable_cell_methods, you get exactly what is in the dataset.
Future Proofing
- Intake Virtual Icechunk uses the latest and most robust data tooling developed by the Pangeo and PyData communities.
- Platform Agnostic: Icechunk supports file system and all major object store interfaces. This means that catalogues built with this package can be stored on disk on Gadi, or in Acacia on Pawsey, and interacted with with no further considerations about storage mechanism.
- Zarr Based: Icechunk implements a transactional storage layer for zarr. By transforming an intake-esm datastore to an intake-virtual-icechunk store, the underlying NetCDF dataset can be readily streamed around the planet, without having to reserialise the data.
inodeexplosions are avoided, and alternative executors such as cubed can be use instead of Dask. - For those interested in interactive dataset exploration and distribution, a sister package (
intake-virtual-icechunk-ts) designed to read these data catalogues in the browser and facilitate streaming interaction of the data contained within the catalogue as an exploration mechanism is also under development.
How should users think about it?
- This is not a whole new analysis interface.
- Icechunk-backed catalogues built with intake-virtual-icechunk should be used in essentially the same way as an intake-esm datastore.
- The storage backend changes, but the user-facing catalog workflow should stay familiar.
Currently, no icechunk backed catalogues are in the ACCESS-NRI Intake Catalog, and we will not make any default transitions until the new technology is fully mature. In the meantime, we will post in here as we virtualise catalogues and make them publicly available.
Useful links
• v0.2.0 release notes: Release v0.2.0 · ACCESS-NRI/intake-virtual-icechunk · GitHub
• v0.1.0 release notes: Release v0.1.0 · ACCESS-NRI/intake-virtual-icechunk · GitHub
• Issues: Issues · ACCESS-NRI/intake-virtual-icechunk · GitHub