Command line utility: pbs-workbench; your own personal ARE

PS: I will be presenting this today at the Reserach Software Community Meeting: RSE Announce - #8 by paocorrales

After having a bunch of problems with ARE sessions I decided to stop using them. But because I still need to spin up a job in a node to run analysis code interactively, I created this small bash tool.

The command

job start

submits a job to the pbs queue that just sits there doing nothing. But then you can ssh into the node and work. It also has a cute monitor system that tells you if the job is running, how much time you’ve got left and quick copy-paste commands do interact with it.

It works well with vscode and its forks and also with jupyter notebooks.

The script supports creating “profiles”, so you can create your own “big”, “small”, “bigmem” or whatever other profile you want, then you spin up your job with job start <name of profile>.

I tried to make installation as straightforward a possible. You need to clone my repo and then run the install script.

Let me know if anyone finds this useful and what other features would you like implemented. Right now the tool only supports one running job that is “global”, but I was thinking that maybe it would be better to use a job per project.

4 Likes

Thank you for sharing this tool @eliocamp!

I think being able to quickly spin up a job on a computing node and being able to connect to it through VSCode is very useful for many people in the community.

I created a similar script years ago following this (fairly old) discussion as I was getting frustrated with ARE, and @abhaasgoyal might also have done something similar based of my script.

In general, I see such a feature being highly requested by people in the community, as multiple forum topics confirm: VSCode Extension to run ARE session, Opening the ARE jupyterlab session on VS code, Working with Jupyter notebooks on gadi/ARE via VS Code.

What are the thoughts about creating a VSCode extension (some old details in the link above) to spin up a computing job and connect to it?
We might need to coordinate with NCI too on this, but I would be happy to help.

Cheers
Davide

We talked about this a bit, but I think the idea of having a standard “API” that other tools can hook could be a good solution. Then, anyone can create VSCode extensions or local bash commands, or whatever (does jupyter notebook have extensions?) using those standard endpoints.

One of the reasons I like the script running on gadi is that I don’t need to worry about supporting multiple OS and IDEs.

I don’t know exactly what that would look like. Maybe it’s a standard module that you can load and all the interaction is done with ssh commands.

1 Like

I think the idea of a standard API is very good. As you mentioned, the API could be used by any IDE extension, or as a CLI from a terminal.

I would say the first approach here is to get in contact with NCI to understand whether there is scope for them to collaborate and see how we can help with that, because I think this is something that should be managed (or at least approved) by NCI in the first place.

A good start to compile a list of all these custom solutions and requests to show that the community really needs this.

  1. pbs-workbench (this tool)
  2. The previous center of excellence for extreme weather used to have this tool.
  3. Discussion here, including manual steps.
  4. Requests from researchers:
    1. Working with Jupyter notebooks on gadi/ARE via VS Code .
    2. https://forum.access-hive.org.au/t/opening-the-are-jupyterlab-session-on-vs-code/22

I believe @sam.green creates an interactive job and sshs into it. We could gather “testimonials” of other researchers’ custom solutions.

Secondly , describe exactly what is missing from the ARE workflow. Some ideas are:

  1. IDEs limited to RStudio and jupyter notebooks.
  2. Virtual desktop is very slow compared with native/local GUIs.
  3. Harcoded modules loaded that cannot be changed after spinup (that is true for rstudio and jupyter, not sure about the VDI).
    1. If in the process of data analysis you realise you need a new module, you need to close everything, add the modules and spin the whole thing up again.
    2. This makes it hard to run the same code interactively in ARE and via terminal or in jobs in the queue.
  4. Relies on external service (what happens if you ARE is down?)

Third, specify what “endpoints” would we need.

  1. Start a workbench with particular resources/profile for a particular ammount of time
  2. List running workbenches.
  3. Get information from a workbench:
    1. Resources
    2. time spent
    3. time left
    4. node address
    5. SU cost?
    6. … Something else?
  4. Stop a workbench
  5. Save “profiles” for particular configurations (similar to ARE ability to save settings)
  6. Modify existing profiles.
  7. …Something else?

For troubleshooting, it might be useful to get easy access to the logs.

1 Like

niice, Positron here we come

1 Like

Thank you @eliocamp for putting together this very good list of resources to get started.

I will give my opinions on some of the points you made, and then suggest an approach to go forward.

Background

Interactive vs non-interactive jobs

I think creating an interactive job is not necessary if a user wants to open the session in an IDE (JupyterLab, VSCode, Positron, etc.), because they would still need to connect the IDE to the compute node, and they don’t need the job to automatically start a terminal session on the compute node (as the interactive job does).
Also note that an interactive job would terminate immediately if the connected terminal session terminates (even if there is a temporary disconnection). A batch job (non-interactive) instead, would continue running until the walltime is hit, or the job is manually killed.
As such, I think it might be preferable for the default job to be non-interactive. Maybe there could be an optional flag (e.g., -I) to spin up an interactive job if the user only needs a terminal session.

IDE Support

“Out of the box” support for multiple IDEs would be good.
About virtual desktops being slow and clunky, I perfectly agree. This was the main reason I stopped using ARE in the first place.
The only real solution for this is using a local IDE to connect to the compute note, and a more streamlined way to do so would be very useful.

Using modules/kernels after spinup

This is not necessarily true. Modules can be loaded/unloaded even within a jupyterlab’s terminal session, as you would normally do on a login node. The only requirement, of course, is to have included the module project folder in the storage when spinning up the job. This is a general requirement that would persist in any case: compute nodes can only “see” filesystem folders that are added as storage.

If you are mainly referring to using specific kernels with the notebooks, jupyter should be able to use kernels even without loading specific modules, as long as it has a kernel spec kernel.json to look at. In this sense, you should be able to create a kernel spec using any environment (for example any conda/analysis3 python environment) and use it within your jupyterlab session. This of course could also be automated to have multiple kernels for the most commonly used environments automatically detected by jupyterlab. These kernels could be selected and used without needing to load any module.

API Endpoints/commands

I agree on all the endpoints you listed.
In general, I would group them in “commands” such as:

  • job: Job control (Start, stop, list jobs, etc.)
  • profile: Profile control (Create, delete, edit user profiles, etc.)
  • resource: Resource control (Check resource usage, etc.)

I also think it would be good to store all the information about a job (including logs) within a folder, similarly to how it’s done with the ~/ondemand folder for ARE.
I think storing job settings and options could be stored in json format, as it would allow them to be easily understood by human and easily processed by multiple software (jq, Python, Typescript, etc.).

Plan to make this happen

What I am going to do first is contacting NCI (@rui.yang) and check whether they would be willing to officially support such an ARE-like API.
Then, based on their involvement, we could start or help with the development.

A few technical ideas on the API:

  • I tried your pbs-workbench tool and checked the source code. I think the functionality and logic is a very good starting point for the API, but being written in bash makes it not too “friendly” to be further developed and extended.
    I think a good language solution would be Python, because it has a good compromise between performance, scalability and easy/quick development.

  • As much as I like pbs-workbench as a name for the tool, if this ends up being an "officially supported ARE-like CLI, I think a more appropriate name for it might be arecli or similar. Moreover, I think job as the main entry point might be a bit too generic and could bring confusion. Again, if this tool is supported as an ARE-like CLI, I think the more sensible entry point name could be are. job could be the subcommand for job control (see API endpoints/commands above). So a job could be started running: are job start.

Ah, I haven’t used jupyterlab enough to have the issue. My experience is mainly with the RStudio server sessions, which don’t support adding modules. (Technically, RStudio provides a terminal and you can load modules there, but the R session runs independently, so you can’t use those modules from R).

I still think that the ARE-like service shouldn’t be in charge of modules. It should spin up a completely blank session equivalent to what you’d get on a login node (save storage requirements) and then the user should manage modules within that session.

The only caveat with this is that if settings are not saved as a regular bash file with PBS directives, then the api for editing those profiles needs to be very robust and user-friendly. The reason I stored them as regular bash files is that I (and presumably other users) would be more or less familiar with the format so it would be easy to edit them by hand.

(Storing in other formats does provide some new opportunities. I’ve always wanted to be able to define storage “aliases”. So, for example, cmip6 would translate to gdata/oi10+gdata/fs38, or era5 would translate to gdata/rt52. That would make the storage directives much more self-documented and easy to write. Something like this might be out of scope for this tool. )

As the one who had to write and debug that messy code, I agree. The reason I used bash is that I didn’t want to have to deal with any external software not available on a vanilla session. Otherwise I might’ve written it in R, since that’s my comfort language. I’m not sure how would an official ARE cli work in terms of managing dependencies. Could one load the are-cli module that requires a particular python version and load a different python version at the same time in the same session?

I agree about the name, although `session` might a better subcommand instead of `job` . I used `job` because I’m super lazy and wanted a very short command.