Icechunk: Earthmover open-sources their ArrayLake backend

Earthmover.io are open sourcing Icechunk, the back-end software for their ArrayLake offering:

This is a seriously impressive bit of software that adds transactions and versioning to zarr datastores.

1 Like

I’m hoping Pawsey HPC engineers are having a serious look for their warm tier object storage cluster, Acacia? And given the hope that NCI will increasingly include on-premises object store capacity that NCI HPC engineers are getting across this to? @Aidan, what is your view on best ways to socialise this across the community?

Find some like-minded folks around here and look into uses cases? I know @anton and @MartinDix are very keen on improving versioning of important data, like model inputs.

Proof of concept demos?

Get someone who is familiar with it to give a presentation?

1 Like

We might get a list together of key people and then ask EarthMover folks to think about tailoring a virtual talk about how to use Icechunk for on-prem object store?

1 Like

:eyes:

2 Likes

I had a chat with Ryan today, we’re setting up a trial account which I’ll use to explore Pawsey object storage. I’ve struggled with the Rust dependency on the docker/Singularity images I use on Pawsey, I know it’s not that hard but it’s just another thing to add to my Python challenges. I think earthmover would jump at working with anyone with GADI experience and object storage.

As another thing, has anyone looked at Arkouda? Pangeo Showcase: "Arkouda as an XArray backend for HPC!" - Pangeo Showcase - Pangeo

Just watched that and they’re interested in testers on HPC. I’m pretty comfy on Pawsey now, but only have limited experience on GADI with Python tooling (so if anyone wants to explore and hand-hold with me that’d be awesome).

2 Likes

This is awesome.

I’m currently in an email chat with Ryan about setting up a virtual showcase for NCI / Pawsey / Australian folks in late January or early February.

4 Likes

That sounds great!

Hey @mdsumner et al

Ryan would love to have a chat with core folks before he gives a wider showcase.

He’d like to speak to data users ( and those who care about them ) about what the current pain points and problems are. He’d like to ask us some questions to shape his presentation.

Can we target this preliminary chat for late January? Who should be on it? I’m happy to help coordinate and organise.

I’m keen. Anton Steketee, Lenneke Jong, Ben Raymond come to mind.

1 Like

Pinging @anton and @lmjong in case they’re interested in contributing.

Is anyone using Icechunk and/or VirtualiZarr on gadi? I’ve used them on Pawsey and I’m trying to get docker image working for singularity.

If there’s an existing or better approach I’d be happy to try it! Thanks

1 Like

Right now, the conda/analysis3 environment is pinned to zarr 2 - currently working to fix that and hoping to have it working ASAP. virtualizarr and icechunk have a hard dependency on zarr v3, and it doesn’t look like icechunk has any python dependencies, so we should be able to add it to that environment.

This stuff has finally all percolated to the top of my todo list, so hopefully there’ll be some work on it from my end over the next few weeks.

My experience from talking to Joe and a little bit of prodding at the library is that icechunk should work fine on a POSIX filesystem, despite being designed for object storage first. Apparently there is a fairly unlikely but possible race condition, but practically it should be unimportant.

If you want to create a more barebones environment in a different/new singularity container, I have a template kicking about I use for some personal development stuff which might also be helpful & I can share.

Sorry for the rambling response - let me know which (if any) parts are useful & I’ll do my best to help!

Hey @mdsumner

I got icechunk working on Gadi with the local filesystem storage API late last week using pixi.

Happy to share the environment setup & notebook stuff with you - it’s currently in a hidden forum post whilst we iron out the kinks in what will potentially become the new custom python environment guide.

I can also just pass you the environment definition to play with - might save some steps for you.

2 Likes

Great that would be awesome! Missed your first reply, since then I’ve run the virtualization pretty hard on Bluelink on gdata and figured out a lot of issues I had. Frankly I think ThreadPoolExecutor in Virtualizarr is a dead end, but ProcessPoolExecutor and other parallel approaches work fine.

I have much stronger hooks into the GDAL multidim API and it works really well, even with the gdata references remapped to thredds so I have the start of my “holy grail” to provide access across languages. Next I need to explore Icechunk and actually establish some virtual stores (getting back into this later in the week). :ok_hand:

1 Like

I put this up, to show connecting to a (non object) kerchunk/parquet store remapped to thredds:

Ive learnt a lot more since then so there’s a big update coming

2 Likes