Newest `conda env` kernels silently change model output analysis results

I was quite alarmed to see this issue

that reports that some of the model analysis results silently change when moving to newest conda analysis env!

I should bring everyone’s attention to it in case other people have been having puzzling results and are struggling to figure out where they are coming from!

If you have seen similar issues please post at this issue – we are trying to sort it out.

2 Likes

Others have definitely had this same issue with the notebook tagged above. @polinash @hrsdawson @Wilton_Aguiar

Not sure about other recipes though.

1 Like

I would encourage people to please report these things sooner rather than later.

Others might be silently struggling a lot to interpret weird results they have been getting…

3 Likes

I’m one of the silently struggling:sweat:

2 Likes

First, I started getting memory allocation errors when I re-run the code with a bit longer dataset. So I switched to the later conda env (conda-23.10) and the memory allocation issue has gone but the plots with cross-contour transports didn’t make any sense.
Upon seeing this post, I rerun the code with older envs (23.01 and older) and plotting looks alright but the notebook can’t handle longer dataset which I need and keeps killing workers, etc.

In my case, I need newer condas to handle my dataset and older condas to compute the transport correctly. In current situation, it seems like I can’t have both… :melting_face:

Slight sideways comment here:

A python package can have tests written for its methods.

Is there any best practice for “writing tests” against an important collection of notebooks which a community use as “working” or “operational” code? Or is this kind of forum discussion and counting on a careful eye the only & best approach?

Those silent failures are the scary ones . . . . :fearful:

Best practice would be a set of tests to do the same thing as our notebooks created by a completely independent team, using different software tools. An impractical approach for anything but the most critical software.

A possible approach we could take is regression testing. Which is to save a reference set of results / runs of the notebooks and then compare them between before / after a change is made.

For us, the issues arise because we don’t use a fixed set of dependencies (i.e. python packages) and instead desire to be keeping up to date to the latest releases of packages. So a user might use any version of conda/analysis3 and we hope that the results don’t change between versions. If the design of the some dependency has changed, then the new results might change, or worse, if a bug was introduced, then results might change too.

Full disclosure: in a former (prescience) life I was a professional software tester (Y2K basically funded my Masters!)

The ‘correct’ way to do this would be to document test cases for each function, with expected outputs for a given set of inputs, as a ‘test script’. (The word script in this case is ambiguous, it would be a document rather than a code script).

The design of each functions test case should focus on cases close to sensible numeric limits (we called these ‘boundary conditions’ but again, ambiguous in ocean modelling). So for example, does a function behave sensibly with a salinity close to zero compared to negative salinity, which is physically impossible (except in the old HadGEM model…)

Obviously these test cases can all be implemented into an executable test script, but there’s value in just sitting down with pen and paper and thinking about what the sensible use cases of a function are first.

In an ideal world we should be doing this as standard (and encouraging students to), so every jupyter notebook would start with a cell outlining the test cases. But, it is extra work…

1 Like

I resonate @Thomas-Moore’s comment.

But cosima-recipes is not a python package but rather a collection of notebooks. These notebooks use python packages which if tested properly they should catch changes in behaviour and if that was intentional they should issue a deprecation warning (e.g. “that methods your_fav_method() will behave differently from version X.Y.Z”) or something.

Regression tests (that @anton suggests) are a way to catch issues like that and we could discuss implementing some of those that will run automatically once a week on the HPC.

I also think that it’s very good idea to try to convey to people the notion of “testing the boundaries” of a method/function they write (comment by @willrhobbs). This is extremely useful concept. I haven’t really thought about it in a formal way as @willrhobbs discussed it. I don’t think we should enforce this to the notebooks since this will make the barrier of newbies contributing to the recipes even higher. But it’s such a useful concept that one should at least have it in the back of their mind.

I so much often see code that is not general enough and its limitations are neither documented nor asserted. For example, someone writes a method/function that works only for a very particular case and will fail if things are slightly different. Then someone else, who naively sees the existence of such method/function use it for their case and gets nonsense results.

def compute_zonal_mean(dataarray):
    return dataarray.mean('xt_ocean')

might suggest that this function computes the zonal mean. But in reality, it computes the zonal mean only for data arrays that have xt_ocean as their coordinate and it also assumes that the coordinate xt_ocean runs across constant latitude values. This function will silently give wrong results if used at the Arctic and will fail if used with MOM6 or MITgcm output.

I’d like to touch on these issues on July 1st Workshop that we are organising.