Christmas present: pycdo - Use CDO from python

To people reading the hive on the 23 of December: Hi to both of you! To the rest, I guess you can open this present after the break.

In an attempt to teach myself python and also create something useful for me and others, I made the pycdo python package. It’s a wrapper to use CDO operations from python and (hopefully) using pythonic syntax.

For those who don’t know, the Climate Data Operators is a command line tool to work with climate data. It’s super fast and efficient and super convenient, especially for relatively simple operations like computing means or regridding. The problem with CDO is that being a command line utility, is hard to program with unless you’re very comfortable with bash.

With pycdo you can construct CDO command and then execute them using python. So, say you wanted to compute the monthly climatology of Southern Hemisphere geopotential height… you would do

from pycdo import cdo
(
    cdo(geopotential)
    .sellonlatbox(0, 360, -90, 0)
    .ymonmean()
    .execute()
)

And that would execute something like this in the terminal

cdo -ymonmean -sellonlatbox,0,360,-90,0 geopotential.nc tempfile.nc

By default pycdo saves the result in a temporary file that is deleted after it’s not longer accessible. That makes it easy to work with NetCDF files as if they were just normal variables without needing to think about names for every single one of them.

A neat feature is that pycdo support caching. You can enable the cache:

from pycdo import cdo_cache
cdo_cache.set("data/cache")

And then every operation will only need to be run once, even across python sessions. pycdo will check that the input are the same and the command is the same, and will fetch the output instead of running the command again. This can save you a lot of time when iterating an analysis!

I’m still working on improving the documentation (and understanding how python documentation even works). But the package works! You can install it from pypi

pip install pycdo

It’s got no weird dependencies other than the standard libraries, so it should be compatible with analysis3. The only system dependency is cdo itself, which you can load on gadi with

module load cdo

Let me know if you find this useful. If you find bugs, you can open an issue in the repo or just send me a PM here.

4 Likes

Hi @eliocamp,

Thank you for this package!
Just out of curiosity, what is the difference between your pycdo and the conda-forge python-cdo (documentation here)?

Haha, you’re not the first to ask. I might need to add a section to the readme!

My implementation has some neat features not available in that version. The biggest one being the chaining syntax.

For example, this is how chaining works in that package

cdo.seltimestep('1/10', input='-selvar,u10,v10 '+infile, output=outfile)

Instead of using the pythonic selvar("u10", "v10"), you need to use the command line syntax -selvar,u10,v10 as a string. So you lose syntax checking, autocomplete and all the goodies from writing code vs. strings, and you also need to understand two slightly different syntaxes. Plus, the order of operations is reversed.

The equivalent using pycdo would be this

(
   cdo("infile")
   .selvar("u10", "v10")
   .seltimestep("1/10")
   .execute(output = outfile)
)

Which, IMHO, is much more readable. Each step is it’s own element in a chain and is written in order. Plus autocomplete save you from typos.

Other features that I think are not available are caching, global options that apply to all operators.

One thing that that package has is an extremely robust logging system. pycdo doesn’t have one, at least not yet. Also, it doesn’t have integration with xarray; pycdo just returns paths to files, while that cdo bindings package can load directly an xarray. Personally, I see this as a feature since it makes the package much simpler and with fewer dependencies, but YMMV.

2 Likes

So who’s not on the hive ( and Gadi ) over Christmas!?! :christmas_tree: :santa_claus: :nerd_face:

Thank you @eliocamp for sharing your open source presents! :wrapped_gift:

2 Likes

Thank you for your answer!

I haven’t used CDO much myself, but the differences you describe look pretty relevant.
Well done!

1 Like

Looks very clean @eliocamp. I recall trying chaining directly in cdo some years ago (6+), in an attempt to avoid creating lots of intermediate files. It worked well for small datasets, but I found the memory usage increased with larger data, like 0.1 degree ocean outputs, such that it was not a practical option.

Has this improved in the intervening time?

I couldn’t say, honestly. I’ve never had memory issues with cdo, but I’ve also never worked with enormous data. I’ve never done any memory profiling either.

As I understand it, for most operations the basic building block is the 2D field; at most it loads one full 2D field at once (roughly speaking). But when chaining, maybe it needs to store the whole 4D intermediate fields into RAM, leading to increased memory use. But that’s speculation on my part!

If that’s a problem, then maybe it would be useful to add a “no chains mode” to pycdo to automatically run execute after each step. That would mean that you could write the same relatively clean chain syntax in python, but then translate that to single operations if your particular data and disk vs RAM trade off required it. And because temporary files are automatically deleted by pycdo once they are not reachable, then it would be totally transparent to the user.

Would that be helpful?

If you’ve got examples of datasets and operations that I could use for profiling and testing, I’d be happy to investigate that too.

Sounds like a good approach if it becomes an issue.

For reasons like this, and to achieve better parallelisation, most people with compute heavy workloads moved to xarray+dask.

Sorry, I’m not in that game anymore, so I have nothing I can easily provide. It was just an observation from previous experience.