I need to download some CMIP runs that are not on gadi currently — 3D ocean temperature from DCPP hindcast experiments — and am wondering what the best pipeline to do so might be?
The hindcast experiment setup is 10-40 ensemble members, run for 10 years, from ~60 initialisations, which I estimate is about 2.5 TB, per model. My current plan would be to put on scratch, post-process to extract the 20°C isotherm data I’m after, and then delete the 3D field.
Should I just use wget from the copy-queue? If so, does anyone have recommendations for where to draw from? Other suggestions or advice from previous experience would be welcome, especially if there’s an existing framework supported by the NCI.
In general, the easiest way to get CMIP6 data is to contact the NCI help desk and ask them to download it. Given the size and relative obscurity of what we are talking about, its probably unlikely they would take that on.
In theory, you can do this via the CMIP THREDDS server without downloading. e.g. you could get urls and open the url directly with xarrays open_dataset. In practice, possibly more trouble than its worth. (Noting ARE jobs are the only ones with both compute resources and internet access). I haven’t tried this but there is a write up at Search and Load CMIP6 Data via ESGF / OPeNDAP — Pangeo Gallery documentation
Lastly, the approach you suggest sounds good. It might be worth asking at the helpdesk which node they download from normally.
The process is broadly documented here https://opus.nci.org.au/spaces/CMIP/pages/26287991/Data+Download+Request
But that method has limitations in a few places.
What I do is still use clef even though the db is no longer updated, as it can query ESGF for us which is very helpful for getting the required dataset names!
That’s as far as I got as I don’t know what the variable name for 3D ocean temp is - to does not seem to be a variable in DCPP but you obviously found what you were looking for so apologies I can’t be more explicit. You may need to further limit search results model-by-model.
Anyway, Clef will tell you what it thinks we already have locally (may be out of date so checking against Intake is the next step), but for what it thinks is missing, you grab the output list of dataset names (e.g. CMIP6.DCPP.CCCma.CanESM5.dcppA-hindcast.s2015-r1i1p2f1.Omon.tob.gn.v20190429 and put them in an email to help@nci.org.au for attention of Syazwan.
Now, if you were asking for a few small variables, Syazwan would go away and download the data into the central system, update the Intake catalogues and everyone would win. However because you’re potentially asking for a lot of data here, you may not have so much luck with the central request, but be aware that if you download yourself, you’ll be doing so in an “unmanaged” fashion, ie the data may become outdated, and it won’t be findable in the NCI Intake catalogue, so there is quite some risk here.
If you have to download yourself, it’s worth using the synda tool. Paola and I used to both have machines set up to help with this sort of use case, but obviously it’s not a service either of us is offering anymore, so… best wishes, I hope NCI are willing to download this data for you! If they’re not, I would strongly suggest looking at the remote-via-OPeNDAP option suggested above, it may be a pain but it’ll save a lot of storage space and download tribulation.