Strange error using xp65

Hi,

I’ve decided today to be adventurous and use the latest conda environ (25.04) in xp65 instead of hh5, and my notebooks that ran with no problems are now giving weird errors (multiple of them). Here is a minimal example:

import cmocean as cm
import dask.distributed as dsk
import gcm_filters
import glob
import gsw
import intake
import matplotlib.gridspec as gridspec
import matplotlib.pyplot as plt
import numpy as np
import xarray as xr
from xgcm import Grid
import scipy.stats as st

import warnings # ignore these warnings
warnings.filterwarnings("ignore", category = FutureWarning)
warnings.filterwarnings("ignore", category = UserWarning)
warnings.filterwarnings("ignore", category = RuntimeWarning)

import logging
logging.getLogger("flox").setLevel(logging.WARNING)

import os
os.chdir('/home/561/jn8053/g_x77/Ross_gyre_colab')

catalog = intake.cat.access_nri
exp = "01deg_jra55v140_iaf_cycle3"

data = catalog[exp].search(variable = 'tx_trans_int_z', frequency = "1mon").to_dask().sel(xu_ocean = slice(-230, -80))
psi = data['tx_trans_int_z'].sel(yt_ocean = slice(None,-50)).cumsum('yt_ocean')/(1035*1e6)
g_str = (-psi.where(psi<-12).sel(xu_ocean = slice(-180, -140))).chunk({'xu_ocean':-1, 'yt_ocean':-1}).quantile(.95, ['xu_ocean', 'yt_ocean']).load()

g_str = g_str.rolling(time=12, center=True).mean('time')

And these are the errors. I’ve never seen them before:

I have switched back to hh5 which works fine and don’t really have the time to troubleshoot this, but thought I’d flag.

1 Like

Thanks Julia

If you still have the errors, can you copy and paste the text here rather than screenshot

The screenshot of the first is just some warnings, but my memory when you showed me was some esoteric BDB error ??

The second error looked like something odd with the access telemetry - @CharlesTurner @tmcadam

Yeah, this is a telemetry issue. @JuliaN is this being triggered by the

g_str = g_str.rolling(time=12, center=True).mean('time')

line?

If so I’ll get a fix in right away, should be relatively simple.

A word of caution re. analysis3-25.04 - Intake ESM mysteriously started triggering segmentation faults in the 25.04 environment, apparently due to a dependency change. We’ve got a bugfix in, but still haven’t released the new fixed version. Hopefully this should be out in the next couple days.

For now, I’d recommend dropping to analysis3-25.02 environment, which:
a. Intake-ESM appears to be stable in.
b. Telemetry isn’t enabled in - so no AST parsing errors.

It happened throughout the code, not just the g_str line, and I had tried to switch back to 24.07 kernel but the error still came up. But maybe it was one of those times you just have to close the ARE sesh and start again

That’s weird, I wouldn’t expect to see the error in previous versions. I’ll do some more digging.

I’ve worked out the source of the bug, so I’ll have the fix released later today - I’ll update once it’s applied to the environment.

Hi @JuliaN,

the issue is now fixed & I’ve modified the relevant packages such that there should be no more issues of this type.

1 Like

Hi Charles, sorry it took so long to reply! I reverted to using hh5 because I had some stuff to get done. I am trying again today with xp65 and I’m getting this same error across all of my scripts that I’ve tried to run. I don’t really see what triggers it, it happens in different cells. I have tried the following kernels: conda/analysis3, conda/analysis3-unstable, Python 3, conda/analysis3-25.04, conda/analysis3.25-02 conda/analysis3-24.04, and a few others I can’t remember. I am not using intake.

I’ll go back to hh5 now but will try to be faster at responding to this troubleshooting. I don’t think my specific scripts are needed - anything using xarray seems to trigger it. But if you do need, I can copy my scripts somewhere public. But like I said… it is happening in all the ones I’ve tried.

Let me copy the error properly:

Error in callback <function capture_registered_calls at 0x1533b6fe3ba0> (for pre_run_cell), with arguments args (<ExecutionInfo object at 15320592f990, raw_cell="# Mask the variables
velocity_tendency_masked = ds.." store_history=True silent=False shell_futures=True cell_id=9718df99-3b4e-473c-a0c3-e56098514fc4>,),kwargs {}:

UPDATE: @jemmajeffree has ran one of my scripts and it worked for her. It must be me then.

I’ve been unable to reproduce the bug with some generic xarray use cases. Currently waiting on membership of x77 - I’ll investigate further then.

However, the fix I mentioned above was applied in the analysis3-25.04 environment, so I’m a little confused.

Either way, there are some potential issues with communication between workers in the 25.04 environment, so I would recommend updating to 25.05 either way.

Sorry Charles, I think this is a me issue because Jemma was able to run my scripts no problem. I’ve put an example in /scratch/x77/jn8053/ASC_Momentum_test.ipynb. I get the error in cell 8.

Further updates. It’s a pretty weird error, and I can’t replicate on my account.

The simplest example we can find is this:

xr.open_mfdataset('/g/data/ik11/outputs/mom6-panan/panant-01-zstar-ACCESSyr2/output055/Vertical_coordinate.nc')['Interface']

fails with something along the lines of

 Error in callback <function capture_registered_calls at 0x15048c0ebba0> (for pre_run_cell), with arguments args (<ExecutionInfo object at 1504309ebe10, raw_cell="vertical_edges.load()" store_history=True silent=False shell_futures=True cell_id=4114a1db-5ca2-434c-bafb-55b4130fe512>,),kwargs {}:

while

xr.open_mfdataset('/g/data/ik11/outputs/mom6-panan/panant-01-zstar-ACCESSyr2/output055/Vertical_coordinate.nc')

(no ['Interface']) is fine.

There also seems to be something weird happening with whether a .load() is in the same cell or the next cell as xr.open_dataset, looks like same cell is fine, next cell is not, but this appeared somewhat stochastic and we didn’t manage to pin down the factors behind it.

Also strange that I can’t see anything different between how Julia and I were running the notebook, but I couldn’t manage to replicate any of the errors

For context: callback <function capture_registered_calls at 0x15048c0ebba0> (for pre_run_cell) is an ipython event we use to listen for certain intake catalog related events & record usage so we can optimise the datasets we support, etc. You can find it’s source code here if you’re interested.

The TLDR; is that this code runs when any notebook cell is executed in order to detect catalog related events, so even if you’re not using the catalog it still runs.

It’s designed to be very permissive, & fail silently if anything during it’s execution fails. What it looks like is happening here is that something that should not be able to fail is, for some reason. The exception handling I’ve used should catch anything that might occur during normal circumstances, so I can only assume a very weird edge case is going on.

N.B I’m also unable to access x77, even with the project added to an ARE session, so I wonder whether there’s potentially something up with Gadi - we had some issues earlier today.

I’m unable to reproduce the error in the example you posted @jemmajeffree. With that said, the inverted comma’s you used aren’t valid syntax: pasting your failing line into a python interpreter gives me this:

  Cell In[4], line 1
    xr.open_mfdataset(‘/g/data/ik11/outputs/mom6-panan/panant-01-zstar-ACCESSyr2/output055/Vertical_coordinate.nc’)['Interface']
                      ^
SyntaxError: invalid character '‘' (U+2018)

I suspect this is more likely than not unrelated & just from copy/pasting the string through some microsoft office text field, though.

@JuliaN when you get the chance, could you post a screencap of your ARE session setup for me?

I’m sorry about the inverted commas, my computer was trying to be helpful and overwriting what I typed with what it thought were the prettier ones. I was aiming for normal string characters, will fix now

No worries, I thought that was probably the case.

I ran the line which should fail with fixed apostrophes without any issue though, & didn’t find any issues doing this either:

ds = xr.open_mfdataset('/g/data/ik11/outputs/mom6-panan/panant-01-zstar-ACCESSyr2/output055/Vertical_coordinate.nc')['Interface']
___ 
%% New Cell
ds.load()

which ran fine in conda/analysis3-25.04.

Interestingly, what you said @JuliaN said above about failures in conda/analysis3.25-02 makes me suspect the issue is a bit more subtle, as the access-py-telemetry code was never added to that environment.

It’s possible that you have an old version of this package in your ~/.local. Could you try running

 $ tree ~/.local | grep 'access_py_telemetry' 

for me and telling me if you get any results?

Thank you Charles and Jemma for all the help! I have tried starting a new ARE session today to continue with this but:

image

I did not forget storage options. gdata/xp65 is listed.

I’ve ssh-ed in to run tree ~/.local | grep 'access_py_telemetry' which gives the following:

image

That is super weird. We had a few strange filesystem issue yesterday, I suspect this filesystem issue is related.

Yesterday, I was completely unable to access anything in x77 (files in the demo note you provided lived in there) - despite similarly having x77 listed in my storage options. I’m guessing this is the same issue - the files for the conda environment live in xp65, and so that would explain why you can’t load the environment.

Re. the tree ~/.local stuff, that explains the errors! We can fix it with this:

$ rm -r ~/.local/lib/python3.11/site-packages/access_py_telemetry*

In general, I would also recommend blowing away anything in your .local folder - it’s a common source of python issues, as these installs are typically accidental and often conflict with versions in the conda environments:

$ rm -r ~/.local/lib/python*
1 Like

Good to know! We suspected it was something related to either junk accumulating in my home directory I have no control/knowledge of or some hard-coded thing from the past.

I’ve cleaned up .local, started a new ARE session (this time it found xp65) aaaaaand my scripts work. STOKED. Thank you for the magic.

I wonder if its worth writing into tutorials/documentation something about “we recommend you do a spring cleaning of the .local folder”? Maybe it will never be found or picked up by anyone though.

You can get python to not look in ~/.local (e.g. to test if there are any conflicts) by using the python flag -s or by setting environment variable PYTHONNOUSERSITE 1. Command line and environment — Python 3.13.3 documentation