SCIDIR: A scientific software distribution repository for bringing reproducible software containers securely to HPCs in Australia

Abstract

Extracted from here

Presenting the idea behind SciDir – A scientific software distribution repository for bringing reproducible software containers securely to HPCs in Australia

Steffen Bollmann1, Aswin Narayanan2, Sarah Beecroft3, Greg D’Arcy5, Nigel Ward6, Peter Marendy4

1The University Of Queensland, St Lucia, QLD, Australia
2National Imaging Facility, St Lucia, QLD, Australia
3Pawsey Supercomputing Research Centre, Kensington, WA, Australia
4The Queensland Cyber Infrastructure Foundation (QCIF) , St Lucia, QLD, Australia
5AARNet, Chatswood, NSW, Australia
6Australian BioCommons, North Melbourne, Victoria, Australia

Abstract

Situation: The analysis of scientific data requires specialised scientific software and processing pipelines. However, researchers often spend inordinate amounts of time compiling the requisite software and troubleshooting dependency conflicts. Furthermore, research results are often difficult to reproduce due to system dependency differences, even when given the original data and analysis code.

Task: In this project, we aim to develop an open-source, community-oriented project that addresses the issues of accessibility and reproducibility of scientific software. In this session, we would like to present the idea, receive feedback from the community and plan how we can work together toward an implementation.

Action: We are proposing to build on previous work around containers on CVMFS in the Neurodesk project and BioCommons to develop a secure scientific software distribution system. The proposed platform consists of a software container build system, where the scientific community proposes software applications and reference datasets. These artefacts are built, packaged in software containers, and scanned for vulnerabilities before being uploaded to a container registry. The software container metadata is stored in a database for fast and transparent tool discovery. A flexible distribution mechanism will enable this software to be used on various computing endpoints.

Result: Our approach would accelerate progress in all scientific disciplines dealing with the processing of data on high-performance computers. It would enable the flexible processing of scientific data across different computing platforms and the portability of analyses between them.

Notes

Overview

Steffen Bollmann presented a fascinating and informative description of the NeuroDesk software platform

"A flexible and scalable data analysis environment for reproducible neuroimaging with Neurodesk.*

which is part of the Australian Electrophysiology Data Analytics PlaTform (AEDAPT) project

  • Focus on containers
  • Automated vulnerability scanning
  • Functional correctness
  • Systematic mechanism to capture meta-data, discovery or citation
  • No production CVMFS deployment in Aus. Neurodesk did this, need nationally supported
  • neurodocker wrapper for docker
  • Neurodesk/neurocontainers repo
  • CIrun for large enough runners to build large containers
  • Periodic rebuild for new software
  • Functional testing often breaks with updates
  • Bio.tools website backed by Elixir.
  • Upload singularity file to Zenodo and mint DOI?
  • Using ARDC harbor and ghcr
  • SHPC can automatically detect binaries and expose via modules
  • containers within containers!

Key technologies

  • Docker

EasyBuild is a software build and installation framework that allows you to manage (scientific) software on High Performance Computing (HPC) systems in an efficient way.

The European Environment for Scientific Software Installations (EESSI, pronounced as “easy”) is a collaboration between different European partners in HPC community.

The goal of this project is to build a common stack of scientific software installations for HPC systems and beyond, including laptops, personal workstations and cloud infrastructure.

The CernVM File System provides a scalable, reliable and low-maintenance software distribution service. It was developed to assist High Energy Physics (HEP) collaborations to deploy software on the worldwide-distributed computing infrastructure used to run data processing applications. CernVM-FS is implemented as a POSIX read-only file system in user space (a FUSE module). Files and directories are hosted on standard web servers and mounted in the universal namespace /cvmfs.

Internally, CernVM-FS uses content-addressable storage and Merkle trees in order to store file data and meta-data. CernVM-FS uses outgoing HTTP connections only, thereby it avoids most of the firewall issues of other network file systems. It transfers data and meta-data on demand and verifies data integrity by cryptographic hashes.