Vertically integrated OHC with ACCESS-OM2-01 -- how to handle a lot of 3D data

Hi! First of all apologies if something like this was posted already. I did a bit of a search on Slack and here and couldn’t find anything directly related that could be a solution.

I am calculating global OHC using output from the 0.1 degree model. I need to do this for 4 experiments, for 40 years and it uses temperature and dzt (monthly averages, 3D grid). You can see my code here OHC.ipynb · GitHub.

So far, it takes me ~25min to get OHC for one year using 28 nodes. I don’t usually use the cookbook with large datasets, but in this case data is spread across different groups and to trace it all would be a bit incovenient. Is there a way to optimise this with the cookbook, or chunking, etc?

Thank you very much in advance!

There’s a scalar variable called total_ocean_heat in the ocean_scalar.nc files. Pretty sure that’s what you want?

rmh561@gadi-login-08 seasonal_sea_level]$ ncdump -h /g/data/ik11/outputs/access-om2-01/01deg_jra55v13_ryf9091/output932/ocean/ocean_scalar.nc 
netcdf ocean_scalar {
...
        double total_ocean_heat(time, scalar_axis) ;
                total_ocean_heat:long_name = "Total heat in the liquid ocean referenced to 0degC" ;
                total_ocean_heat:units = "Joule/1e25" ;
                total_ocean_heat:valid_range = 0., 1.e+20 ;
                total_ocean_heat:missing_value = -1.e+20 ;
                total_ocean_heat:_FillValue = -1.e+20 ;
                total_ocean_heat:cell_methods = "time: mean" ;
                total_ocean_heat:time_avg_info = "average_T1,average_T2,average_DT" ;

I want spatial maps of OHC :confused:

Oh ok sorry. So you want vertically-integrated OHC.

Give you’re doing this over a large number of files. One approach would be to use nco tools and a bunch of different PBS jobs? Edit: I see your comment on the files being in funny places. Sorry I’m not sure off the top of my head.

1 Like

Yeah, I guess for now I just have to resign to this taking a couple of days :frowning:

My end results would be to plot OHC averaged over different decades… If I calculated mean T and dzt and then calculated mean OHC for that decade, would it be the same as calculating it from monthly data and then averaging OHC?

In other words, can I do the time average before the vertical integration? Or will I be introducing an error by time averaging cell height before the vertical integral?

You will introduce an error that way because you won’t capture temporal correlations between dzt and temp. Whether these errors are significant depends on what you’re trying to do with the results. For most applications (i.e. not trying to close budgets) I’d think it is probably fine and it sounds like a good approach (note that you’re already missing sub-monthly correlations between these anyway). But I’d suggest testing it on a subset of the data first.

1 Like

No, it’s not the same. But the temporal variations in dzt are small, so I expect the error would be small. You could do a test over a couple of years of time averaging before and after to check the size of the error.

1 Like

Well, @JuliaN from your notebook seems like want to compute the vertically integrated OHC for every year. But your notebook just does a for loop to loop over the years… That’s not at all efficient because each year has to wait for the previous year to finish before it starts. You can parallelize this using xarray somehow but I need to play around to figure how to do this.

This notebook might be useful… but you might need to “filter out” quite a bit.

https://github.com/navidcy/IntrinsicOceanicLFVariabilityUOHC/blob/master/prepare-raw-data.ipynb

  • First of all in that notebook I’m dealing with both ssh and 3D temp data; ignore the ssh
  • Second I’m also regridding to a 1-degree lat-lon; ignore the regridding
  • But lastly I do compute the UOHC as integral over some depth and save monthly output – that might be of interest.

I think this:

output_folder = "output/temp-010deg-RYF/"

years, das = zip(*temp_010_RYF.groupby("time.year"))
ds = [d.to_dataset(name="temp") for d in das]
paths = [output_folder + "temp-{}.nc".format(y) for y in years]

xr.save_mfdataset(ds, paths)

will write netCDF files in parallel (not wait each year to finish before starting working on next year).

Note, btw, that what’s needed here, most possibly, is mostly xarray+dask manipulations and choosing appropriate chunks and saving intermediate output files. It doesn’t have to do with the cookbook (as the msg at the top of the thread might be implying).

(Suggestion: perhaps rename the thread to something a bit more descriptive of the situation?)

Why write individual files to years anyway? Just resample to yearly and do the sum directly on the resampled data.

Thanks everyone! I don’t want to resample to yearly, I want the original monthly averages. I’m just doing it year by year because it won’t take the entire decade, fair enough.

Thanks Navid! I’ll look into that notebook

1 Like

Well I am not arguing that what I did was in anyway the best solution! It was just a solution that worked in a reasonable amount of time! :slight_smile:

1 Like

It is the best so far!

1 Like

Well, I did bang my head on the (dask) wall many times until I converged to the what I did.
I’m glad you find that helpful. @angus-g was instrumental in getting me out of my troubles back then…

This page has some useful stuff that you might be able to use for your embarassingly parallel case: Embarrassingly parallel Workloads — Dask Examples documentation

1 Like