Thank you @eliocamp for putting together this very good list of resources to get started.
I will give my opinions on some of the points you made, and then suggest an approach to go forward.
Background
Interactive vs non-interactive jobs
I think creating an interactive job is not necessary if a user wants to open the session in an IDE (JupyterLab, VSCode, Positron, etc.), because they would still need to connect the IDE to the compute node, and they don’t need the job to automatically start a terminal session on the compute node (as the interactive job does).
Also note that an interactive job would terminate immediately if the connected terminal session terminates (even if there is a temporary disconnection). A batch job (non-interactive) instead, would continue running until the walltime is hit, or the job is manually killed.
As such, I think it might be preferable for the default job to be non-interactive. Maybe there could be an optional flag (e.g., -I) to spin up an interactive job if the user only needs a terminal session.
IDE Support
“Out of the box” support for multiple IDEs would be good.
About virtual desktops being slow and clunky, I perfectly agree. This was the main reason I stopped using ARE in the first place.
The only real solution for this is using a local IDE to connect to the compute note, and a more streamlined way to do so would be very useful.
Using modules/kernels after spinup
This is not necessarily true. Modules can be loaded/unloaded even within a jupyterlab’s terminal session, as you would normally do on a login node. The only requirement, of course, is to have included the module project folder in the storage when spinning up the job. This is a general requirement that would persist in any case: compute nodes can only “see” filesystem folders that are added as storage.
If you are mainly referring to using specific kernels with the notebooks, jupyter should be able to use kernels even without loading specific modules, as long as it has a kernel spec kernel.json to look at. In this sense, you should be able to create a kernel spec using any environment (for example any conda/analysis3 python environment) and use it within your jupyterlab session. This of course could also be automated to have multiple kernels for the most commonly used environments automatically detected by jupyterlab. These kernels could be selected and used without needing to load any module.
API Endpoints/commands
I agree on all the endpoints you listed.
In general, I would group them in “commands” such as:
job: Job control (Start, stop, list jobs, etc.)profile: Profile control (Create, delete, edit user profiles, etc.)resource: Resource control (Check resource usage, etc.)
I also think it would be good to store all the information about a job (including logs) within a folder, similarly to how it’s done with the ~/ondemand folder for ARE.
I think storing job settings and options could be stored in json format, as it would allow them to be easily understood by human and easily processed by multiple software (jq, Python, Typescript, etc.).
Plan to make this happen
What I am going to do first is contacting NCI (@rui.yang) and check whether they would be willing to officially support such an ARE-like API.
Then, based on their involvement, we could start or help with the development.
A few technical ideas on the API:
-
I tried your pbs-workbench tool and checked the source code. I think the functionality and logic is a very good starting point for the API, but being written in
bashmakes it not too “friendly” to be further developed and extended.
I think a good language solution would be Python, because it has a good compromise between performance, scalability and easy/quick development. -
As much as I like pbs-workbench as a name for the tool, if this ends up being an "officially supported ARE-like CLI, I think a more appropriate name for it might be arecli or similar. Moreover, I think
jobas the main entry point might be a bit too generic and could bring confusion. Again, if this tool is supported as an ARE-like CLI, I think the more sensible entry point name could beare.jobcould be the subcommand for job control (see API endpoints/commands above). So a job could be started running:are job start.