SCIDIR: A scientific software distribution repository for bringing reproducible software containers securely to HPCs in Australia

Aidan · 27 October 2023 01:30

Abstract

Extracted from here

Presenting the idea behind SciDir – A scientific software distribution repository for bringing reproducible software containers securely to HPCs in Australia

Steffen Bollmann1, Aswin Narayanan2, Sarah Beecroft3, Greg D’Arcy5, Nigel Ward6, Peter Marendy4

1The University Of Queensland, St Lucia, QLD, Australia
2National Imaging Facility, St Lucia, QLD, Australia
3Pawsey Supercomputing Research Centre, Kensington, WA, Australia
4The Queensland Cyber Infrastructure Foundation (QCIF) , St Lucia, QLD, Australia
5AARNet, Chatswood, NSW, Australia
6Australian BioCommons, North Melbourne, Victoria, Australia

Abstract

Situation: The analysis of scientific data requires specialised scientific software and processing pipelines. However, researchers often spend inordinate amounts of time compiling the requisite software and troubleshooting dependency conflicts. Furthermore, research results are often difficult to reproduce due to system dependency differences, even when given the original data and analysis code.

Task: In this project, we aim to develop an open-source, community-oriented project that addresses the issues of accessibility and reproducibility of scientific software. In this session, we would like to present the idea, receive feedback from the community and plan how we can work together toward an implementation.

Action: We are proposing to build on previous work around containers on CVMFS in the Neurodesk project and BioCommons to develop a secure scientific software distribution system. The proposed platform consists of a software container build system, where the scientific community proposes software applications and reference datasets. These artefacts are built, packaged in software containers, and scanned for vulnerabilities before being uploaded to a container registry. The software container metadata is stored in a database for fast and transparent tool discovery. A flexible distribution mechanism will enable this software to be used on various computing endpoints.

Result: Our approach would accelerate progress in all scientific disciplines dealing with the processing of data on high-performance computers. It would enable the flexible processing of scientific data across different computing platforms and the portability of analyses between them.

Notes

Overview

Steffen Bollmann presented a fascinating and informative description of the NeuroDesk software platform

"A flexible and scalable data analysis environment for reproducible neuroimaging with Neurodesk.*

which is part of the Australian Electrophysiology Data Analytics PlaTform (AEDAPT) project

Focus on containers
Automated vulnerability scanning
Functional correctness
Systematic mechanism to capture meta-data, discovery or citation
No production CVMFS deployment in Aus. Neurodesk did this, need nationally supported
neurodocker wrapper for docker
Neurodesk/neurocontainers repo
CIrun for large enough runners to build large containers
Periodic rebuild for new software
Functional testing often breaks with updates
Bio.tools website backed by Elixir.
Upload singularity file to Zenodo and mint DOI?
Using ARDC harbor and ghcr
SHPC can automatically detect binaries and expose via modules
containers within containers!

Key technologies

Docker

Easybuild

EasyBuild is a software build and installation framework that allows you to manage (scientific) software on High Performance Computing (HPC) systems in an efficient way.

EESSI

The European Environment for Scientific Software Installations (EESSI, pronounced as “easy”) is a collaboration between different European partners in HPC community.

The goal of this project is to build a common stack of scientific software installations for HPC systems and beyond, including laptops, personal workstations and cloud infrastructure.

CernVM Filesystem (CVFMS)

The CernVM File System provides a scalable, reliable and low-maintenance software distribution service. It was developed to assist High Energy Physics (HEP) collaborations to deploy software on the worldwide-distributed computing infrastructure used to run data processing applications. CernVM-FS is implemented as a POSIX read-only file system in user space (a FUSE module). Files and directories are hosted on standard web servers and mounted in the universal namespace /cvmfs.

Internally, CernVM-FS uses content-addressable storage and Merkle trees in order to store file data and meta-data. CernVM-FS uses outgoing HTTP connections only, thereby it avoids most of the firewall issues of other network file systems. It transfers data and meta-data on demand and verifies data integrity by cryptographic hashes.

Topic		Replies	Views
Workshop on Correctness and Reproducibility for Climate and Weather Software General workshop , testing , reproducibility	0	209	12 September 2023
Building a spack community in Australia spack spack , community , eresearch	12	403	30 June 2025
Icechunk: Earthmover open-sources their ArrayLake backend Technical zarr , data , database	10	102	9 December 2024
Join in! Investigating analysis-ready data (ARD) strategies to increase impact of ocean and climate model archives at NCI COSIMA	7	137	16 October 2024
Community Talks 1: Aidan Heerdegen (ACCESS-NRI) RRR: Reliability, Replicability, Reproducibility for Climate Models ACCESS Workshop Day 1 workshop-2024	4	55	9 September 2024

SCIDIR: A scientific software distribution repository for bringing reproducible software containers securely to HPCs in Australia

Abstract

Abstract

Notes

Overview

Key technologies

Related topics