Cannot kill a suite

Hi Team,

I have a suite that has run successfully, but still seems to think it is running.

rose sgc tells me ‘stopped with ‘succeeded’’

If I try

cylc stop 'u-di850'

I get:

Cannot connect: https://cylc.slf563.jk72.ps.gadi.nci.org.au:43092/set_stop_cleanly?kill_active_tasks=False: HTTPSConnectionPool(host='cylc.slf563.jk72.ps.gadi.nci.org.au', port=43092): Max retries exceeded with url: /set_stop_cleanly?kill_active_tasks=False (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f169ae32e90>: Failed to establish a new connection: [Errno -2] Name or service not known',))

cylc kill 'u-di850' gives same response.

Anyone know how I can stop this suite so I can re-run it!?

Thanks,
Sonya

Hi Sonya,

I am sure someone else will have a more thorough response, but it you want to run it again in the short-term, could you copy the roses/u-di850 directory to roses/u-di850.restart and then run from the new roses directory?

It won’t solve the initial problem but might be a workaround …

Best regards,

Chermelle

Hi Sonya,

If you are still having any trouble with the job not stopping please email help@nci.org.au and hopefully they can point you in the right direction. Hopefully the problem resolved by itself.

Best regards,

Chermelle

2 Likes

Hi Sonya

I assume you’ve managed to fix this problem now, but in future, here are some commands I use to help kill errant rose suites and Cylc tasks.

Some of these you may already know.

From Rose User Guide: Suite Control

rose suite-stop
rose suite-shutdown
rose suite-clean

If there are still errant individual cylc processes I go after them with combinations of

cylc scan
cylc ping -v ROSE_ID

To find the Linux task/process ID of any cylc processes that remain you can use

ps -fu $USER | grep cylc
ps -ef | grep $USER

and then kill <PID>.

See Cylc scan and cylc kill - Cylc Support - Cylc Workflow Engine for an example

Although the error message you came up with looks more like a networking error b/w your persistent session and ARE/gadi and not a cylc issue per se.

Cheers

Paul

1 Like

Thanks Paul and Chermelle - actually the problem still persists. The rose commands all give me a similar error message as the original. The cylc scan command is telling me:

ERROR: [Errno -2] Name or service not known: cylc.slf563.jk72.ps.gadi.nci.org.au

I had also suspected this was because I’ve somehow stuffed up my persistent sessions - and possibly this confirms that to. I’ve not got a lot of time today to look at it, but I will try to take another look probably next week & let you know if I resolve it! I’ll email nci help as suggested too if I can’t figure it out then!

Thanks for your help so far!

Hi Sonya,

The persistent sessions get killed after about 6 months or so.

I usually just restart mine.

You can check if your persistent session is still running with:

persistent-sessions list

See Persistent Sessions... - NCI Help - Opus - NCI Confluence

I hope that helps.

Best regards,

Chermelle

1 Like

Thanks Chemelle + Paul - I’ve managed to fix it - was a persistent session problem. It had clearly been killed I guess with the maintenance they had to do a few weeks ago and I just didn’t even think of it…

3 Likes