DataLad extension for working with Debian packages and package repositories

_images/datalad_debian_logo_with_names.svg

This software wraps building and disseminating Debian packages in a standard data management task. While there is no shortage of specialized software solutions for Debian package workflows and archives, here, the general purpose data management solution DataLad is used to present all involved steps as a system that tracks inputs, and converts them to outputs, with full capture of actionable provenance information for all respective transformations. Importantly, the system is fully decentralized and whole processes and/or individual steps can be performed by independent collaborators with no required access to a common build or distribution infrastructure.

Features include:

  • Version control of Debian source packages, and/or provenance capture of generating such source packages from upstream sources

  • Building Debian binary packages from source packages (reproducibly) in portable, containerized build environments

  • Maintain collections of source and binary packages built for a given target distribution release, to provide or maintain access to historical build artifacts similar to https://snapshot.debian.org

  • Generate and update APT package repositories for particular versions of a package collection

This software only implements the data management and provenance tracking features. Specialized tasks, such as repository generation, or building binary packages from source packages are performed using standard solutions such as reprepro, or dpkg

To get a feel for this machinery, see the walk-through.

Overview

Concepts & Terms

Components

All components are implemented in the form of DataLad datasets, interconnected (via super-subdataset relationships) to express dependencies. For more information about DataLad's super-subdataset relationships, please refer to the DataLad Handbook.

builder dataset: environments (e.g. Singularity images) to build binary packages from source packages using a containerized pipeline for a single target distribution.

package dataset: source and binary Debian packages for a single piece of software specific to a distribution major release. Contains a builder dataset, attached as a subdataset, for the respective distribution.

distribution dataset: collection of package datasets built for a specific distribution release (e.g. Debian 10). Contains a single builder subdataset for the respective distribution release.

archive dataset: Debian package archive for deployment on a webserver as an apt repository. Contains any number of distribution datasets as subdatasets that are used as sources to populate the archive.

Walk-through

Let's take a look at the main steps and building blocks to go from source code to a fully functional APT repository that any Debian-based machine can retrieve and install readily executable software packages from.

Some notes upfront:

Whenever we refer to a "Debian (distribution) release" in this walk-through, we more or less consider any release of any Debian-based distribution. The demonstrated features are not strictly limited to Debian proper, but are available for other distributions, such as Ubuntu, too.

This walk-through demos the command line interface of DataLad. However, an analog Python API is provided too.

Create a distribution dataset

A distribution dataset is a DataLad superdataset that contains all necessary components to build Debian packages for a particular Debian release. These include the packages' source code, and a build environment.

To create a collection of packages, i.e. a distribution, start by creating a distribution dataset.

datalad deb-new-distribution bullseye
cd bullseye

This creates the distribution dataset "bullseye". The generated dataset is still pretty generic, and not anyhow tailored to a particular Debian release yet.

Besides the distribution superdataset, a builder subdataset was created too. It contains the subdirectories:

  • envs: hosts generated build environments for any number of target CPU architectures

  • recipes: actionable descriptions for (re)creating a build environment (e.g. singularity recipe)

bullseye/
└── builder/
     ├── envs/
     │   └── README.md
     └── recipes/
         └── README.md

Configure and create a package builder

Before we can start building Debian packages, the builder subdataset must be configured to provide one or more build environments for the targeted Debian release and the target CPU architecture. Start by configuring the builder for the specific Debian release, e.g. for "bullseye".

datalad deb-configure-builder --dataset builder dockerbase=debian:bullseye

This creates a singularity container recipe for a Debian bullseye environment based on a default template. Check the documentation of deb-configure-builder for additional configuration options, for example to enable non-free package sources.

The command deb-bootstrap-builder can now be run to bootstrap the (containerized) build environment based on the recipe just created.

datalad deb-bootstrap-builder --dataset builder

The builder dataset now contains a container image that can be used for building packages. The generated container image is registered in the builder dataset for use by the datalad-container extension and its containers-run command. Using singularity, the entire container acts as a single executable that takes a Debian source package as input and builds Debian binary packages in its environment as output.

With the builder prepared, we can save the resulting state of the builder in the distribution dataset.

datalad save --dataset . --message "Update distribution builder" builder
bullseye/
└── builder/
     ├── envs/
     │   ├── README.md
     │   └── singularity-amd64.sif
     └── recipes/
         ├── README.md
         └── singularity-any

Running datalad status or git status in the distribution dataset now confirms that all components are comprehensively tracked. Inspecting the commits of the two created datasets, and in particular those of the builder dataset, reveals how DataLad captures the exact process of the build environment generation.

Add a package to the distribution

To add a package, start by creating a new package dataset inside of the distribution dataset.

datalad deb-new-package hello

This creates a new package subdataset for a source package with the name hello under the packages subdirectory of the distribution dataset. Inspecting the created dataset, we can see another builder subdataset. In fact, this is the builder dataset of the distribution, linked via DataLad's dataset nesting capability.

This link serves a dual purpose. 1) It records which exact version of the builder was used for building particular versions of a given source package, and 2) it provides a canonical reference for updating to newer versions of the distribution's builder, for example, after a Debian point release.

The package dataset can now be populated with a Debian source package version. In the simplest case, a source package is merely placed into the dataset and the addition is saved. This is what we will do in a second.

However, DataLad can capture more complex operations too, for example, using tools like git-buildpackage to generate a source package from a "debian" packaging branch of an upstream source code repository. An upstream code repository can be attached as a subdataset, at the exact version needed, and git-buildpackage can be executed through datalad run to capture the full detail of the source package generation.

For this walk-through, we download version 2.10 of the hello package from snapshot.debian.org:

cd packages/hello
datalad run -m "Add version 2.10-2 source package" \
  dget -d -u https://snapshot.debian.org/archive/debian/20190513T204548Z/pool/main/h/hello/hello_2.10-2.dsc

The fact that we obtained the source package files via this particular download is recorded by datalad run (run git show to see the record in the commit message).

Build binary packages

With a Debian source package saved in a package dataset, we have all components necessary for a Debian binary package build. Importantly, we will perform this build in the local context of the package dataset. Although in the walk-through the package dataset is placed inside a clone of the distribution dataset, this particular constellation is not required. Building package is possible and support in any (isolated) clone of the package dataset.

To build Debian binary packages we can use DataLad's deb-build-package command parametrized with the source package's DSC filename.

datalad deb-build-package hello_2.10-2.dsc

As with the download before, DataLad will capture the full provenance of the package build. The command will compose a call to datalad containers-run to pass the source package on to the builder in the builder subdataset. Both this builder dataset, and the actual singularity image with the containerized build environment is automatically obtained. This is possible, because the package dataset exhaustively captures all information on source code to build, and on the environment to build it in. Built binary packages, metadata files, and build logs are captured in a new saved package dataset state -- precisely linking the build inputs with the generated artifacts (again check git show for more information).

If desired, deb-build-package can automatically update the builder dataset prior to a build. Otherwise the build is done using whatever builder environment is registered in the dataset, for example, to re-build historical versions of a dataset with the respective historical build environment version.

Updating a package dataset with new versions of the Debian source package, and building binary packages from them is done by simply repeating the respective steps.

Create an archive dataset

With Debian source and binary packages organized in distribution and package datasets, the remaining step is to generate a package archive that APT can use to retrieve and install Debian packages from. A dedicated tool that can do this is reprepro, and is also the main work horse here. Applying the previously used patterns of dataset nesting to tracking inputs, and capturing the provenance of tool execution, we will use reprepro to ingest packages from our distribution dataset into an APT package archive.

The first step is to create the archive DataLad dataset:

# this is NOT done inside the distribution dataset
cd ..
datalad deb-new-reprepro-repository apt
cd apt

We give it the name apt, but this is only the name of the directory the dataset is created in.

apt/
├── conf/
│   ├── distributions
│   └── options
├── distributions/
│   └── README
├── README
└── www/

The dataset is pre-populated with some content that largely reflects an organization required by reprepro described elsewhere. Importantly, we have to adjust the file conf/distributions to indicate the component of the APT archive that reprepro shall generate and which packages to accept. A minimal configuration for this demo walk-through could be:

Codename: bullseye
Components: main
Architectures: source amd64

A real-world configuration would be a little more complex, and typically list a key to sign the archive with, etc. Once we completed the configuration, we can save the archive dataset:

datalad save -m 'Configured archive distributions'

Now we are ready to link a distribution to the archive. This will be the source Debian package will be incorporated into the archive from:

datalad deb-add-distribution ../bullseye bullseye

The deb-add-distribution command takes two mandatory arguments: 1) a source URL for a distribution dataset, and 2) a name to register the distribution under. In a real-world case the source URL will be pointing to some kind of hosting service. Here we obtain it from the root directory of the walk-through demo.

apt/
├── conf/
│   ├── distributions
│   └── options
├── distributions/
│   ├── bullseye/
│   │   ├── builder/
│   │   └── packages/
│   │       └── hello/
│   └── README
├── README
└── www/

As we can see, the archive dataset now links the distribution dataset, and also its package dataset, in a consistent, version tree (confirm clean dataset state with datalad status).

Ingest Debian package into an archive dataset

With all information tracked in DataLad dataset, we can automatically determine which packages have been added and built in any linked distribution since the last archive update -- without having to operate a separate upload queue. This automatic queue generation and processing is performed by the deb-update-reprepro-repository command.

datalad deb-update-reprepro-repository

Running this command on the archive dataset will fetch any updates to all linked distribution datasets, and perform a diff with respect to the last change recorded for the reprepro output directory www/.

As we can see when running the command, no packages are ingested. That is because when adding the Debian source package and building the binary packages for hello version 2.10, we only saved the outcomes in the respective package dataset. We did not register the package dataset update in the distribution dataset. This missing step is the equivalent of authorizing and accepting a package upload to a distribution in a centralized system.

So although there is an update of a package dataset, it will not be considered for inclusion into the APT archive without formally registering the update in the distribution. This is done by saving the package datasets state in the distribution dataset

cd ../bullseye
datalad save -m "Accept hello 2.10 build for amd64"

Rerunning deb-update-reprepro-repository now does detect the package update, automatically discovers the addition of the source package, and the recently built binary packages, and ingest them both into the APT archive dataset.

datalad deb-update-reprepro-repository

After reprepro generated all updates to the archive, DataLad captures all those changes and links all associated inputs and outputs of this process in a clean dataset hierarchy. We can confirm this with datalad status, and git log -2 shows the provenance information for the two internal reprepro runs involved in this APT archive update.

After the update, the working tree content of the archive dataset looks like this:

apt/
├── conf/
│   ├── distributions
│   └── options
├── db/
│   ├── checksums.db
│   ├── contents.cache.db
│   ├── packages.db
│   ├── references.db
│   ├── release.caches.db
│   └── version
├── distributions/
│   ├── bullseye/
│   │   ├── builder/
│   │   └── packages/
│   │       └── hello/
│   │           ├── builder/
│   │           ├── hello_2.10-2_amd64.buildinfo
│   │           ├── hello_2.10-2_amd64.changes
│   │           ├── hello_2.10-2_amd64.deb
│   │           ├── hello_2.10-2.debian.tar.xz
│   │           ├── hello_2.10-2.dsc
│   │           ├── hello_2.10.orig.tar.gz
│   │           ├── hello-dbgsym_2.10-2_amd64.deb
│   │           └── logs/
│   │               └── hello_2.10-2_20220714T073633_amd64.txt
│   └── README
├── README
└── www/
    ├── dists/
    │   └── bullseye/
    │       ├── main/
    │       │   ├── binary-amd64/
    │       │   │   ├── Packages
    │       │   │   ├── Packages.gz
    │       │   │   └── Release
    │       │   └── source/
    │       │       ├── Release
    │       │       └── Sources.gz
    │       └── Release
    └── pool/
        └── main/
            └── h/
                └── hello/
                    ├── hello_2.10-2_amd64.deb
                    ├── hello_2.10-2.debian.tar.xz
                    ├── hello_2.10-2.dsc
                    ├── hello_2.10.orig.tar.gz
                    └── hello-dbgsym_2.10-2_amd64.deb

All added files in the archive dataset are managed by git-annex, meaning only their file identity (checksum) is tracked with Git, not their large content. The files in db/ are required for reprepro to run properly on subsequent updates. A dedicated configuration keeps them in an "unlocked" state for interoperability with reprepro. All other files are technically symlinks into the file content "annex" operated by git-annex.

A webserver can expose the www/ directory as a fully functional APT archive. However, www/ is actually a dedicated DataLad (sub)dataset, which can also be cloned to a different location, and updates can be propagated to it via datalad update at any desired interval.

Moreover, the www/ subdataset can also be checked-out at any captured archive update state (e.g. its state on a particular). This makes it possible to provide any snapshot of the entire APT archive in a format that is immediately accessible to any apt client.

In between archive dataset updates, it is not necessary to keep the distribution and package datasets around. To avoid accumulation of disk space demands, these can be dropped:

datalad drop -d . --what all -r distributions

Dropping is a safe operation. DataLad verifies that all file content and the checked-out dataset state remains available from other clones when the local clones are removed. The next run of deb-update-reprepro-repository will re-obtain any necessary datasets automatically.

The archive dataset can now be maintained for as long as desired, by repeated the steps for updating package datasets, registering these updates in their distribution datasets, and running deb-update-reprepro-repository to ingest the updates in the APT archive.

Commands and API

Command line reference

The order of commands follows their logical order of execution in a start-to-end packaging workflow.

datalad deb-new-distribution

Synopsis
datalad deb-new-distribution [-h] [-d DATASET] [-f] [--version] [PATH]
Description

Create a new distribution dataset

A typical distribution dataset contains a 'builder' subdataset with one or more build environments, and a package subdirectory with one subdataset per Debian package. This command creates the initial structure: A top-level dataset under the provided path and a configured 'builder' subdataset underneath.

Examples

Create a new distribution dataset called 'bullseye' in the current directory:

% datalad deb-new-distribution bullseye
Options
PATH

path where the dataset shall be created, directories will be created as necessary. If no location is provided, a dataset will be created in the location specified by --dataset (if given) or the current working directory. Either way the command will error if the target directory is not empty. Use --force to create a dataset in a non-empty directory. Constraints: value must be a string or Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

-h, --help, --help-np

show this help message. --help-np forcefully disables the use of a pager for displaying the help message

-d DATASET, --dataset DATASET

specify a dataset whose configuration to inspect rather than the global (user) settings. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

-f, --force

enforce creation of a dataset in a non-empty directory.

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

datalad deb-configure-builder

Synopsis
datalad deb-configure-builder [-h] [-d DATASET] [-f] [--template PATH] [--version] [property=value [property=value ...]]
Description

Configure a package build environment

A builder is a (containerized) build environment used to build binary Debian packages from Debian source packages. This command is typically run on the builder dataset in a distribution dataset and configures a builder recipe based on a template and user-specified values for the template's placeholders. The resulting recipe will be placed in the 'recipes/' directory of the builder dataset.

The following directory tree illustrates this. The configured builder takes the form of a Singularity recipe here.

bullseye <- distribution dataset
├── builder <- builder subdataset
│ ├── envs
│ │   └── README.md
│ └── recipes
│ ├── README.md
│ └── singularity-any <- builder configuration

Currently supported templates are

Template 'default'

This is a Singularity recipe with the following configuration items:

  • dockerbase (required): name of a Docker base image for the container, i.e. 'debian:bullseye'

  • 'debian_archive_sections (optional): which sections of the Debian package archive to enable for APT in the build environment. To enable all sections set to 'main contrib non-free'. Default: 'main'

Examples

Configure the default Singularity recipe in the builder subdataset, executed from a distribution superdataset:

% datalad deb-configure-builder -d builder dockerbase=debian:bullseye
Options
property=value

Values to replace placeholders in the specified template.

-h, --help, --help-np

show this help message. --help-np forcefully disables the use of a pager for displaying the help message

-d DATASET, --dataset DATASET

Specify a builder dataset in which an environment will be defined. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

-f, --force

enforce creation of a package dataset in a non-empty directory.

--template PATH

Builder recipe template. This is a text file for placeholders in Python string formating syntax. Constraints: value must be a string or value must be NONE [Default: 'default']

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

datalad deb-bootstrap-builder

Synopsis
datalad deb-bootstrap-builder [-h] [-d DATASET] [--version]
Description

Bootstrap a build environment

This command bootstraps a (containerized) build environment (such as a Singularity container) based on an existing builder configuration (such as a Singularity recipe).

The execution of this command might require administrative privileges and could prompt for a sudo password, for example to build a Singularity image. The resulting bootstrapped build environment will be placed inside of a 'envs/' subdirectory of a 'builder/' dataset.

The following directory tree illustrates this. The configured builder takes the form of a Singularity recipe here.

bullseye <- distribution dataset
├── builder <- builder subdataset
│ ├── envs
│ │   ├── README.md
│ │ └── singularity-amd64.sif <- bootstrapped build environment
│ └── recipes
│ ├── README.md
│ └── singularity-any <- builder configuration

Examples

Bootstrap a configured build environment in a builder subdataset, from a distribution dataset:

% datalad deb-bootstrap-builder -d builder
Options
-h, --help, --help-np

show this help message. --help-np forcefully disables the use of a pager for displaying the help message

-d DATASET, --dataset DATASET

specify a builder dataset that contains a build environment configuration. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

datalad deb-new-package

Synopsis
datalad deb-new-package [-h] [-d DATASET] [-f] [--version] NAME
Description

Create a new package dataset inside of a distribution dataset

In its final stage, a typical package dataset contains the source files, built binaries, and builder subdataset for a Debian package. This command creates the initial structure: A package dataset in the 'package/' subdirectory of a distribution dataset and a 'builder' subdataset underneath it. It should be run in the root of a distribution dataset with a configured and bootstrapped builder, as the distribution's 'builder' subdataset will be registered in the package dataset to be used to build the binaries. To prevent package name clashes within a distribution dataset, it is advisable to use the Debian package's name as the name for the package dataset.

Examples

Create a new package dataset 'hello' in a distribution dataset:

% datalad deb-new-package hello
Options
NAME

name of the package to add to the distribution. Constraints: value must be a string or value must be NONE

-h, --help, --help-np

show this help message. --help-np forcefully disables the use of a pager for displaying the help message

-d DATASET, --dataset DATASET

specify a distribution dataset to add the package to. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

-f, --force

enforce creation of a package dataset in a non-empty directory.

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

datalad deb-build-package

Synopsis
datalad deb-build-package [-h] [-d DATASET] [--update-builder] [--version] DSC
Description

Build binary packages

Perform a provenance tracked build of a binary Debian package from a .dsc file in a package dataset. The command relies on a (containerized) build environment within a package's 'builder' subdataset. The 'builder' subdataset can optionally be updated beforehand.

Beyond binary .deb files, this command creates a .changes, a .buildinfo, and a logs/.txt file with build metadata and provenance. All resulting files are placed into the root of the package dataset.

Examples

Build a binary package from a Debian package's source .dsc file:

% datalad deb-build-package hello_2.10-2.dsc
Options
DSC

Specify the .dsc source file to build from. Constraints: value must be a string or value must be NONE

-h, --help, --help-np

show this help message. --help-np forcefully disables the use of a pager for displaying the help message

-d DATASET, --dataset DATASET

Specify the package dataset of the to-be-built package. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

--update-builder

Update the builder subdataset from its origin before package build.

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

datalad deb-new-reprepro-repository

Synopsis
datalad deb-new-reprepro-repository [-h] [-d DATASET] [-f] [--version] [PATH]
Description

Create a new (reprepro) package repository dataset

Examples

Options
PATH

path where the dataset shall be created, directories will be created as necessary. If no location is provided, a dataset will be created in the location specified by --dataset (if given) or the current working directory. Either way the command will error if the target directory is not empty. Use --force to create a dataset in a non-empty directory. Constraints: value must be a string or Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

-h, --help, --help-np

show this help message. --help-np forcefully disables the use of a pager for displaying the help message

-d DATASET, --dataset DATASET

specify a dataset whose configuration to inspect rather than the global (user) settings. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

-f, --force

enforce creation of a dataset in a non-empty directory.

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

datalad deb-update-reprepro-repository

Synopsis
datalad deb-update-reprepro-repository [-h] [-d DATASET] [--version] [PATH]
Description

Update a (reprepro) Debian archive repository dataset

Examples

Options
PATH

path to constrain the update to. Constraints: value must be a string or Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

-h, --help, --help-np

show this help message. --help-np forcefully disables the use of a pager for displaying the help message

-d DATASET, --dataset DATASET

specify a dataset to update. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

datalad deb-add-distribution

Synopsis
datalad deb-add-distribution [-h] [-d DATASET] [--version] SOURCE NAME
Description

Add a distribution dataset to a Debian archive repository dataset

Examples

Options
SOURCE

URL, DataLad resource identifier, local path or instance of distribution dataset to be added. Constraints: value must be a string

NAME

name to add the distribution dataset under (directory distributions/<name>). Constraints: value must be a string or value must be NONE

-h, --help, --help-np

show this help message. --help-np forcefully disables the use of a pager for displaying the help message

-d DATASET, --dataset DATASET

specify the Debian archive repository dataset to add the distribution to. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

Python module reference

The order of commands follows their logical order of execution in a start-to-end packaging workflow.

deb_new_distribution([path, dataset, force])

Create a new distribution dataset

deb_configure_builder(*[, dataset, force, ...])

Configure a package build environment

deb_bootstrap_builder(*[, dataset])

Bootstrap a build environment

deb_new_package(name, *[, dataset, force])

Create a new package dataset inside of a distribution dataset

deb_build_package(dsc, *[, dataset, ...])

Build binary packages

deb_new_reprepro_repository([path, dataset, ...])

Create a new (reprepro) package repository dataset

deb_update_reprepro_repository([path, dataset])

Update a (reprepro) Debian archive repository dataset

deb_add_distribution(source, name, *[, dataset])

Add a distribution dataset to a Debian archive repository dataset