DMTN-218: The LSST Science Pipelines Build System

  • Kian-Tat Lim

Latest Revision: 2022-06-02

1   Introduction

The LSST Science Pipelines “stack” is a large software system composed of dozens of packages. The build system, based on the Jenkins tool, tests this system, certifies releases, and creates and publishes binary artifacts. This document describes the current state of the system, exposing some of its complexities.

2   Functions

The build system is responsible for several functions, each of which is represented by a pipeline within the Jenkins system:

  • Manually-triggered integration testing (“stack-os-matrix”)
  • Nightly and weekly automated releases (“nightly-release” and “weekly-release”)
  • Nightly “clean build” tests, some with extended testing (“ci_hsc”, “ci_imsim”, and “lsst_distrib”)
  • Release candidate and official release builds (“official-release”)

The release processes, whether nightly, weekly, or official, have multiple subsidiary tasks, some of which are Jenkins pipelines and some of which are individual pipeline stages:

  • Build the software, ensuring it passes its unit tests.
  • Run limited integration tests (by building and testing the “lsst_ci” product, which in turn includes the “pipelines_check” product).
  • Tag the GitHub repositories that make up the software to indicate the release content.
  • Publish source packages, an environment file, and a tag file to the eups.lsst.codes distribution server allowing others to use “eups distrib install” to install the release.
  • Generate and publish tarball binary packages for Linux and macOS to the eups.lsst.codes distribution server to speed up installations.
  • Generate and publish a Docker container containing the software to hub.docker.com and Google Artifact Registry (and potentially other registries).
  • Execute characterization jobs that generate metrics about the quality of the Alert Production and Data Release Production components of the release that are pushed to the SQuaSH system (“ap_verify” and “verify_drp_metrics”).
  • Trigger, via a GitHub Action, building and publication of the JupyterLab container image for the Rubin Science Platform Notebook Aspect.

In addition, there is a separate system that maintains the “shared stack” installation of the Science Pipelines on the developer systems at NCSA.

3   Components

3.1   Jenkins (ci.lsst.codes)

Control (the “jenkins-master” node) is hosted in the Elastic Computing Cloud (EC2) in Amazon Web Services (AWS); this serves the Jenkins UI at https://ci.lsst.codes.

Workers are “agent-ldfc” pods running on the NCSA k8s-devel Kubernetes (K8s) cluster. They make up a StatefulSet called “agent-ldfc”. (These pods may have been configured using Terraform somewhere but almost certainly have had many manual changes not reflected in the original Terraform.) Each pod is composed of a docker-in-docker container (https://github.com/lsst-sqre/docker-dind/), a docker-gc container that cleans up old docker containers and images (https://github.com/lsst-sqre/docker-docker-gc), and the main swarm container (https://github.com/lsst-sqre/docker-jenkins-swarm-client) that actually communicates with the Jenkins central control and executes pipelines. The pods use disk space at /project/jenkins/prod, with each subdirectory allocated as a separate PV in K8s. This space is located on GPFS at NCSA, but it is mounted via NFS from lsst-nfs.ncsa.illinois.edu in order to protect the filesystem from potential problems caused by the use of root in the Jenkins containers.

Workers are also installed on macOS machines located in the Tucson AURA headquarters machine room. These machines are named mac1-6.lsst.cloud. The Jenkins UI is used to start and configure workers on these machines, launching agents via ssh to the “square” account using credentials stored in the “sqre-osx” Jenkins secret.

There is a final “snowflake” worker used for release builds that also runs in EC2.

3.2   eups.lsst.codes

The primary publication location for the eups distribution server is a Simple Storage Service (S3) bucket at AWS. Publication occurs in certain pipelines via the “aws s3 cp” command using credentials stored in the “aws-cmirror-push” Jenkins secret. This command is obtained from the lsstsqre/awscli:latest container, which is in turn built from a Dockerfile in lsst-sqre/docker-awscli by the monthly-triggered sqre/infra/build-awscli Jenkins job.

The data in that bucket is replicated via code in https://github.com/lsst-sqre/terraform-scipipe-publish/tree/master/s3sync to a Persistent Disk filesystem attached to a deployment in Google Kubernetes Engine (GKE) at Google Cloud Platform (GCP).

The GKE deployment runs a vanilla Apache container as specified in https://github.com/lsst-sqre/terraform-scipipe-publish/blob/master/tf/modules/pkgroot/pkgroot-deploy.tf along with an nginx ingress in order to support HTTPS. The result is https://eups.lsst.codes.

3.3   hub.docker.com

The primary publication location for Docker containers is DockerHub at hub.docker.com. Containers are published using credentials stored in the “dockerhub-sqreadmin” Jenkins secret. The same containers are published to Google Artifact Registry at GCP using credentials stored in the “google_archive_registry_sa” Jenkins secret.

3.4   GitHub

Release pipelines tag GitHub repositories to clearly designate what versions of the source code were incorporated into a release build. These tagging operations use lsst-sqre/codekit as well as credentials in the “github-api-token-sqreadmin” Jenkins secret.

3.5   GitHub Actions

GitHub Actions (GHA) workflows in each repository are used to perform simple “lint”-style syntax checking and in certain cases more extensive tests. Because each Science Pipelines package typically depends on many others, and because they frequently change together as well as separately, it is not considered feasible to have per-repository GHA workflows build and test each package.

3.6   LSST-the-Docs

Certain pipelines publish documentation via the LSST-the-Docs system. Credentials for this are in the “ltd-mason-aws” and “ltd-keeper” Jenkins secrets.

3.7   Slack

Most pipelines publish notifications of start, success, and failure to Slack channels using a custom Groovy library that uses the Slack API. Credentials for this are in the “ghslacker” Jenkins secret.

3.8   SQuaSH

Release pipelines measure certain metrics based on applying the Science Pipelines code to known data. These metrics are pushed to a metrics dashboard system known as SQuaSH using the lsst/verify framework. This framework takes credentials for an API endpoint which are stored in the “squash-api-user” Jenkins secret.

3.9   conda-forge

The third-party dependencies (Python and C++) of the Science Pipelines are, to the extent possible, installed in a conda environment via the rubin-env metapackage from the conda-forge channel. conda-forge is used because it has strong policies around maintaining consistency and interoperability of the packages it publishes.

Matthew Becker takes weekly and official releases of the Science Pipelines and builds them into a single conda-forge package called “stackvana”.

3.10   CernVM-FS

CernVM-FS is a globally-distributed, locally-cached read-only shared POSIX filesystem. CC-IN2P3 takes tagged weekly and official release source packages in the eups distribution server and rebuilds them into a binary “stack” installation in CernVM-FS, including a base rubin-env conda environment and an extended one with additional convenience packages. Singularity container images are also produced and stored in this system. Other artifacts could be similarly published.

As a shared filesystem, it is easy to ensure that developer systems and batch poroduction worker systems share the same view of the software to be executed. This makes CernVM-FS an attractive software distribution mechanism for user-level applications that do not need the OS-level package and isolation that containers provide. Note that while it is not a container registry per se, as mentioned, container images can still be usefully disseminated via CernVM-FS.

3.11   lsst-sqre/ci-scripts

This repo contains four scripts:

  • create_xlinkdocs.sh runs the doxygen build for the entire stack, resulting in doxygen.lsst.codes. It is invoked by lsstswBuild.sh.
  • jenkins_wrapper.sh translates from Jenkins-specified environment variables to script arguments for lsstswBuild.sh. It executes deploy from lsstsw to prepare the build tree and environment.
  • lsstswBuild.sh invokes envconfig from lsstsw to initialize the conda environment and then invokes rebuild to actually perform the build. If successful, it runs the doxygen build using create_xlinkdocs.sh.
  • run_verify_drp_metrics.sh sets up the code in faro and a dataset and then runs a dataset-dependent script to generate metrics by analyzing the results of running pipeline algorithms on that dataset. This is triggered by the “verify_drp_metrics” post-release job in Jenkins.

3.12   lsst/lsstsw

This repo contains code that was originally intended to handle the process of publishing source and binary tarball packages to the eups distribution server. It has since expanded to be a more general-purpose multi-package build tool for the Science Pipelines. Information on it is available in https://developer.lsst.io/stack/lsstsw.html

The primary scripts here are:

  • deploy, which installs needed code including conda, the rubin-env environment, and the lsst_build tool.
  • rebuild, which uses lsst_build to prepare eups package sources and then build them.
  • publish, which takes an existing eups installation and creates distribution server packages, tag files, and environment listings in a separate directory. This “distribution server” directory is ready to be mirrored to the real Web-hosted distribution server.

Some configuration information for the scripts is contained in etc/settings.cfg.sh. The etc/manifest.remap file must contain the names of all packages that use Git LFS, as they cannot be packaged normally by eups. etc/exclusions.txt is likely vestigial.

The lsst/versiondb repo is used to maintain records of the versions of packages that have had builds attempted. See the README file in lsst/lsst_build for more information.

3.13   lsst/lsst_build

This repo is used by lsst/lsstsw. It contains Python code to rapidly clone all of the packages needed to build a Science Pipelines product, given the git repository configuration in lsst/repos, check out appropriate git refs in each clone, and then invoke eupspkg to build them if needed.

3.14   lsst/lsst

This repo contains the newinstall.sh and lsstinstall scripts that create the appropriate environment for using eups distrib to install Science Pipelines packages, either from source or from binary tarballs. They install conda, the rubin-env environment, and configure an eups “stack” location, and they create a script that can be sourced to activate this environment in a shell.

3.15   eups, eupspkg, and eups distrib

eups is the package manager used by the Science Pipelines. It enables flexible combinations of versions of packages, including under-development versions. Some information about it is available at https://developer.lsst.io/stack/eups-tutorial.html

eupspkg is the tool within eups that builds source and binary packages. It has extensive documentation in a docstring within https://github.com/RobertLuptonTheGood/eups/blob/master/python/eups/distrib/eupspkg.py Note that there are two kinds of source packages: “git” and “package”. “git” packages merely refer to a particular repo and so use much less space on the distribution server but somewhat more space on the installing client. “package” packages include a complete copy of the source code, so they use much more space on the distribution server but less space on the client.

eups distrib is an independent module within eups that handles interactions with a distribution server that provides source and/or binary packages. There are several types, but we currently use only the eupspkg variety, as specified in https://eups.lsst.codes/stack/src/config.txt Note that the binary tarball servers also have similar configuration files, such as https://eups.lsst.codes/stack/osx/10.9/conda-system/miniconda3-py38_4.9.2-2.0.0/config.txt

3.16   sconsUtils

sconsUtils is the library of code used with the scons build tool that customizes it for Science Pipelines use. It standardizes handling of C++ and Python code as well as documentation, tests, and eups packaging information. In addition to package dependencies from eups table files, it also uses special ups/*.cfg files to track dependency information, particularly for C++. (However dependency information for C++-accessible shared libraries in the rubin-env conda environment is obtained from sconsUtils/configs, not from ups directories.)

4   Docker Containers

Several containers are published via the build system.

4.1   newinstall

The “newinstall” container contains the conda environment used for the Science Pipelines. Since this environment changes much less frequently than the Science Pipelines code, it saves time and space to have it as a base container. This container is built by the “sqre/infra/build-newinstall” job, which is triggered on updates to the “lsst/lsst” GitHub repository or manually whenever desired. Typically it would be triggered when a new build becomes available of the rubin-env conda environment that might fix a (temporary) problem in a previous container build.

Note that the build-newinstall job builds the version of the rubin-env environment that is specified in etc/scipipe/build-matrix.yaml, not the default in newinstall itself. The container is pushed with a tag containing that version, as well as a “latest” tag that is typically enabled.

4.2   centos

The “centos” container contains the LSST Science Pipelines code in “minimized” form. The lsst-sqre/docker-tarballs Dockerfile is used to install a “stack” from binary tarballs and then to strip out debugging symbols, test code, documentation in HTML and XML form, and C++ source code. The “shebangtron” script that fixes “#!” lines in Python scripts is also executed.

4.3   sciplat-lab

Jenkins used to build the sciplat-lab containers used by the Rubin Science Platform directly, but it now merely triggers a certain GitHub Action using the “github-api-token-sqreadmin” credentials.

5   Jenkins Pipelines

Most of these pipelines use complex Groovy scripts to describe their stages and steps. One technique used frequently is to place the main activity of the stage within a “run()” function, write a dynamic Dockerfile, build a Docker container from it, and then execute the “run()” function within that Docker container. This provides isolation at the cost of some complexity.

Much of the common pipeline code is found in the large library “pipeline/lib/util.groovy”.

5.1   Bootstrap

5.1.1   sqre/seeds/dm-jobs

Most pipelines are written in Groovy and have two components: a “job” component that defines parameters for the pipeline and its triggers, and a “pipeline” component that defines the stages and steps to be executed.

The “seeds” pipeline installs all of the “job” components in the Jenkins configuration, allowing it to be defined by code rather than manual manipulation of the GUI. It must be rerun any time a “job” component is modified. It does not need to be rerun when a “pipeline” component is modified, as those are dynamically loaded from the “main” branch of lsst-dm/jenkins-dm-jobs as each pipeline begins execution.

Typically the seeds pipeline is automatically triggered by updates to the lsst-dm/jenkins-dm-jobs repo.

5.2   Science Pipelines builds

These build pipelines do not publish artifacts, but the extended integration test run by some of them do publish metrics.

5.2.1   stack-os-matrix

The primary build used by developers. Runs on Linux and macOS. To enable these jobs to run as rapidly as possible, they reuse state from previous builds, including the rubin-env environment. However, this state grows with time so it does get cleaned up periodically.

The stack-os-matrix pipeline, via several layers of library code in pipeline/lib/util.groovy, invokes two layers of scripts in lsstsqre/ci-scripts (jenkinsWrapper.sh and lsstswBuild.sh) which in turn invoke the (somewhat documented in pipelines.lsst.io) lsst/lsstsw build tool which in turn uses the (relatively undocumented) lsst/lsst_build tool to invoke eupspkg on each repository which, for LSST Science Pipelines packages, invokes scons and the sconsUtils library to actually do the build and test of each package.

5.2.2   scipipe/lsst_distrib

Clean build of the main branch of the Science Pipelines and lsst_ci integration tests. The latter is primarily “pipelines_check”, a minimal “aliveness” test; it also forces building and testing of several “obs_*” packages, Since this build installs rubin-env from scratch, it ensures that we are prepared for any dependency updates.

5.2.3   scipipe/ci_hsc

Clean build of the ci_hsc integration tests. Note that Science Pipelines packages that are not used by ci_hsc are not built. For now, “ci_hsc” runs both “ci_hsc_gen2” and “ci_hsc_gen3” tests, although Gen2 will soon be removed.

5.2.4   scipipe/ci_imsim

Clean build of the ci_imsim integration tests. Note that Science Pipelines packages that are not used by ci_imsim are not built.

5.3   Container builds

5.3.1   sqre/infra/build-newinstall

Builds the newinstall container as described above.

5.3.2   sqre/infra/build-sciplatlab

Triggers the GHA to build the RSP container as described above.

5.4   Administrative tasks

5.4.1   sqre/infra/jenkins-node-cleanup

Runs periodically (every 10 minutes) to check the amount of free space in each worker’s workspace. If this falls below the configured threshold (100 GiB default), the contents of the workspace directory will be removed unless a job is actively using it. If the “FORCE_CLEANUP” parameter is specified, all workers’ workspaces will be cleaned unless they have active jobs. If the “FORCE_NODE” parameter is specified and “FORCE_CLEANUP” is not, only that node will be cleaned if it does not have an active job.

5.4.2   sqre/infra/clean-locks

Manually triggered when an interrupted build leaves eups lock files behind. In most cases nowadays, eups locking should be disabled, meaning that this job should be unnecessary.

5.5   Release builds

These builds also publish doxygen output to doxygen.lsst.codes.

5.5.1   release/nightly-release

Nightly build (d_YYYY_MM_DD)

5.5.2   release/weekly-release

Weekly build (w_YYYY_WW)

5.5.3   release/official-release

Official release build (vNN)

5.6   Release build components

5.6.1   release/run-rebuild

Runs a complete build, unit tests, and default integration tests on the canonical platform (Linux). The build occurs in a directory that is reused from run to run. This means that the rubin-env environment is typically not identical to what would be newly installed.

5.6.2   release/run-publish

Publishes source packages, the release tag, and an environment file to the eups distribution server. The version number of the rubin-env environment is recorded. This environment file records the packages in rubin-env and any explicit constraints on them, but it does not give exact versions, as it is OS-independent and the exact packages are OS-dependent.

5.6.3   release/tarball

Builds binary tarballs from the source packages, copies them into a local “distribution server” directory, tests that binary installs work correctly, including running a minimal check, and publishes the distribution server directory to the cloud distribution server. The exact packages used for this build are recorded in an environment file on the eups distribution server. Note that these packages may differ from those used in the run-rebuild pipeline above, as newinstall.sh is used to create the environment each time.

Also note that both the “stack” directory in which the packages are installed and the “distribution server” directory are reused, so previously-built packages do not need to be rebuilt.

5.6.4   docker/build-stack

Builds the Science Pipelines Linux container from the binary tarballs, editing the result as described earlier.

5.7   Triggered post-release jobs

5.7.1   sqre/infra/documenteer

Builds and publishes an edition of the pipelines.lsst.io website based on the centos Science Pipelines container.

5.7.2   scipipe/ap_verify

Runs ap_verify code from the centos Science Pipelines container on test datasets, publishing metrics to SQuaSH.

5.7.3   sqre/verify_drp_metrics

Runs faro code from the centos Science Pipelines container on test datasets, publishing metrics to SQuaSH.