DataLad extension for working with Debian packages and package repositories
This software wraps building and disseminating Debian packages in a standard data management task. While there is no shortage of specialized software solutions for Debian package workflows and archives, here, the general purpose data management solution DataLad is used to present all involved steps as a system that tracks inputs, and converts them to outputs, with full capture of actionable provenance information for all respective transformations. Importantly, the system is fully decentralized and whole processes and/or individual steps can be performed by independent collaborators with no required access to a common build or distribution infrastructure.
Features include:
Version control of Debian source packages, and/or provenance capture of generating such source packages from upstream sources
Building Debian binary packages from source packages (reproducibly) in portable, containerized build environments
Maintain collections of source and binary packages built for a given target distribution release, to provide or maintain access to historical build artifacts similar to https://snapshot.debian.org
Generate and update APT package repositories for particular versions of a package collection
This software only implements the data management and provenance tracking features. Specialized tasks, such as repository generation, or building binary packages from source packages are performed using standard solutions such as reprepro, or dpkg
To get a feel for this machinery, see the walk-through. If that sparks your interest, continue with the walk-through on collaborative package distribution for a more conceptual illustration.
Overview
Concepts & Terms
Components
All components are implemented in the form of DataLad datasets, interconnected (via super-subdataset relationships) to express dependencies. For more information about DataLad's super-subdataset relationships, please refer to the DataLad Handbook.
builder
dataset: environments (e.g. Singularity images) to build binary
packages from source packages using a containerized pipeline for a single target
distribution.
package
dataset: source and binary Debian packages for a single piece
of software specific to a distribution major release. Contains a builder
dataset, attached as a subdataset, for the respective distribution.
distribution
dataset: collection of package
datasets built for a specific
distribution release (e.g. Debian 10). Contains a single builder
subdataset
for the respective distribution release.
archive
dataset: Debian package archive for deployment on a webserver as an
apt repository. Contains any number of distribution
datasets as subdatasets
that are used as sources to populate the archive.
Walk-through
Let's take a look at the main steps and building blocks to go from source code to a fully functional APT repository that any Debian-based machine can retrieve and install readily executable software packages from.
Some notes upfront:
Whenever we refer to a "Debian (distribution) release" in this walk-through, we more or less consider any release of any Debian-based distribution. The demonstrated features are not strictly limited to Debian proper, but are available for other distributions, such as Ubuntu, too.
This walk-through demos the command line interface of DataLad. However, an analog Python API is provided too.
Create a distribution dataset
A distribution dataset is a DataLad superdataset that contains all necessary components to build Debian packages for a particular Debian release. These include the packages' source code, and a build environment.
To create a collection of packages, i.e. a distribution, start by creating
a distribution
dataset.
datalad deb-new-distribution bullseye
cd bullseye
This creates the distribution
dataset "bullseye". The
generated dataset is still pretty generic, and not anyhow tailored to a
particular Debian release yet.
Besides the distribution superdataset, a builder
subdataset was created too.
It contains the subdirectories:
envs
: hosts generated build environments for any number of target CPU architecturesrecipes
: actionable descriptions for (re)creating a build environment (e.g. singularity recipe)
bullseye/
└── builder/
├── envs/
│ └── README.md
└── recipes/
└── README.md
Configure and create a package builder
Before we can start building Debian packages, the builder
subdataset must
be configured to provide one or more build environments for the targeted
Debian release and the target CPU architecture. Start by configuring the builder
for the specific Debian release, e.g. for "bullseye".
datalad deb-configure-builder --dataset builder dockerbase=debian:bullseye
This creates a singularity container recipe for a Debian bullseye environment based on a default template. Check the documentation of deb-configure-builder for additional configuration options, for example to enable non-free package sources.
The command deb-bootstrap-builder
can now be run to bootstrap the
(containerized) build environment based on the recipe just created.
datalad deb-bootstrap-builder --dataset builder
The builder
dataset now contains a container image that can be used for
building packages. The generated container image is registered in the
builder
dataset for use by the datalad-container extension and its
containers-run
command. Using singularity, the entire container acts as a
single executable that takes a Debian source package as input and builds
Debian binary packages in its environment as output.
With the builder prepared, we can save the resulting state of the builder
in the distribution
dataset.
datalad save --dataset . --message "Update distribution builder" builder
bullseye/
└── builder/
├── envs/
│ ├── README.md
│ └── singularity-amd64.sif
└── recipes/
├── README.md
└── singularity-any
Running datalad status
or git status
in the distribution dataset now
confirms that all components are comprehensively tracked. Inspecting the
commits of the two created datasets, and in particular those of the builder
dataset, reveals how DataLad captures the exact process of the build environment
generation.
Add a package to the distribution
To add a package, start by creating a new package
dataset inside of the
distribution
dataset.
datalad deb-new-package hello
This creates a new package
subdataset for a source package with the name
hello
under the packages
subdirectory of the distribution
dataset.
Inspecting the created dataset, we can see another builder
subdataset. In
fact, this is the builder
dataset of the distribution, linked via DataLad's
dataset nesting capability.
This link serves a dual purpose. 1) It records which exact version of the builder was used for building particular versions of a given source package, and 2) it provides a canonical reference for updating to newer versions of the distribution's builder, for example, after a Debian point release.
The package dataset can now be populated with a Debian source package version. In the simplest case, a source package is merely placed into the dataset and the addition is saved. This is what we will do in a second.
However, DataLad can capture more complex operations too, for example, using
tools like git-buildpackage
to generate a source package from a "debian"
packaging branch of an upstream source code repository. An upstream code
repository can be attached as a subdataset, at the exact version needed, and
git-buildpackage
can be executed through datalad run
to capture the
full detail of the source package generation.
For this walk-through, we download version 2.10 of the hello
package from
snapshot.debian.org
:
cd packages/hello
datalad run -m "Add version 2.10-2 source package" \
dget -d -u https://snapshot.debian.org/archive/debian/20190513T204548Z/pool/main/h/hello/hello_2.10-2.dsc
The fact that we obtained the source package files via this particular download
is recorded by datalad run
(run git show
to see the record in the commit
message).
Build binary packages
With a Debian source package saved in a package dataset, we have all components necessary for a Debian binary package build. Importantly, we will perform this build in the local context of the package dataset. Although in the walk-through the package dataset is placed inside a clone of the distribution dataset, this particular constellation is not required. Building package is possible and support in any (isolated) clone of the package dataset.
To build Debian binary packages we can use DataLad's deb-build-package
command
parametrized with the source package's DSC filename.
datalad deb-build-package hello_2.10-2.dsc
As with the download before, DataLad will capture the full provenance of the
package build. The command will compose a call to datalad containers-run
to
pass the source package on to the builder in the builder
subdataset. Both this
builder dataset, and the actual singularity image with the containerized build
environment is automatically obtained. This is possible, because the package
dataset exhaustively captures all information on source code to build, and
on the environment to build it in. Built binary packages, metadata files, and
build logs are captured in a new saved package dataset state -- precisely
linking the build inputs with the generated artifacts (again check
git show
for more information).
If desired, deb-build-package
can automatically update the builder dataset
prior to a build. Otherwise the build is done using whatever builder environment
is registered in the dataset, for example, to re-build historical versions of
a dataset with the respective historical build environment version.
Updating a package dataset with new versions of the Debian source package, and building binary packages from them is done by simply repeating the respective steps.
Create an archive dataset
With Debian source and binary packages organized in distribution and package
datasets, the remaining step is to generate a package archive that APT can use
to retrieve and install Debian packages from. A dedicated tool that can do this
is reprepro
, and is also the main work horse here. Applying the
previously used patterns of dataset nesting to tracking inputs, and capturing
the provenance of tool execution, we will use reprepro
to ingest packages
from our distribution dataset into an APT package archive.
The first step is to create the archive DataLad dataset:
# this is NOT done inside the distribution dataset
cd ..
datalad deb-new-reprepro-repository apt
cd apt
We give it the name apt
, but this is only the name of the directory the
dataset is created in.
apt/
├── conf/
│ ├── distributions
│ └── options
├── distributions/
│ └── README
├── README
└── www/
The dataset is pre-populated with some content that largely reflects an
organization required by reprepro
described elsewhere. Importantly,
we have to adjust the file conf/distributions
to indicate the
component of the APT archive that reprepro
shall generate and which
packages to accept. A minimal configuration for this demo walk-through
could be:
Codename: bullseye
Components: main
Architectures: source amd64
A real-world configuration would be a little more complex, and typically list a key to sign the archive with, etc. Once we completed the configuration, we can save the archive dataset:
datalad save -m 'Configured archive distributions'
Now we are ready to link a distribution to the archive. This will be the source Debian package will be incorporated into the archive from:
datalad deb-add-distribution ../bullseye bullseye
The deb-add-distribution
command takes two mandatory arguments:
1) a source URL for a distribution dataset, and 2) a name to register
the distribution under. In a real-world case the source URL will be
pointing to some kind of hosting service. Here we obtain it from the
root directory of the walk-through demo.
apt/
├── conf/
│ ├── distributions
│ └── options
├── distributions/
│ ├── bullseye/
│ │ ├── builder/
│ │ └── packages/
│ │ └── hello/
│ └── README
├── README
└── www/
As we can see, the archive dataset now links the distribution dataset,
and also its package dataset, in a consistent, version tree (confirm
clean dataset state with datalad status
).
Ingest Debian package into an archive dataset
With all information tracked in DataLad dataset, we can automatically
determine which packages have been added and built in any linked
distribution since the last archive update -- without having to
operate a separate upload queue. This automatic queue generation and
processing is performed by the deb-update-reprepro-repository
command.
datalad deb-update-reprepro-repository
Running this command on the archive dataset will fetch any updates to
all linked distribution datasets, and perform a diff
with respect
to the last change recorded for the reprepro
output directory
www/
.
As we can see when running the command, no packages are ingested. That is because when adding the Debian source package and building the binary packages for hello version 2.10, we only saved the outcomes in the respective package dataset. We did not register the package dataset update in the distribution dataset. This missing step is the equivalent of authorizing and accepting a package upload to a distribution in a centralized system.
So although there is an update of a package dataset, it will not be considered for inclusion into the APT archive without formally registering the update in the distribution. This is done by saving the package datasets state in the distribution dataset
cd ../bullseye
datalad save -m "Accept hello 2.10 build for amd64"
Rerunning deb-update-reprepro-repository
now does detect the package
update, automatically discovers the addition of the source package, and the
recently built binary packages, and ingest them both into the APT archive
dataset.
datalad deb-update-reprepro-repository
After reprepro
generated all updates to the archive, DataLad captures all
those changes and links all associated inputs and outputs of this process
in a clean dataset hierarchy. We can confirm this with datalad status
,
and git log -2
shows the provenance information for the two internal
reprepro
runs involved in this APT archive update.
After the update, the working tree content of the archive dataset looks like this:
apt/
├── conf/
│ ├── distributions
│ └── options
├── db/
│ ├── checksums.db
│ ├── contents.cache.db
│ ├── packages.db
│ ├── references.db
│ ├── release.caches.db
│ └── version
├── distributions/
│ ├── bullseye/
│ │ ├── builder/
│ │ └── packages/
│ │ └── hello/
│ │ ├── builder/
│ │ ├── hello_2.10-2_amd64.buildinfo
│ │ ├── hello_2.10-2_amd64.changes
│ │ ├── hello_2.10-2_amd64.deb
│ │ ├── hello_2.10-2.debian.tar.xz
│ │ ├── hello_2.10-2.dsc
│ │ ├── hello_2.10.orig.tar.gz
│ │ ├── hello-dbgsym_2.10-2_amd64.deb
│ │ └── logs/
│ │ └── hello_2.10-2_20220714T073633_amd64.txt
│ └── README
├── README
└── www/
├── dists/
│ └── bullseye/
│ ├── main/
│ │ ├── binary-amd64/
│ │ │ ├── Packages
│ │ │ ├── Packages.gz
│ │ │ └── Release
│ │ └── source/
│ │ ├── Release
│ │ └── Sources.gz
│ └── Release
└── pool/
└── main/
└── h/
└── hello/
├── hello_2.10-2_amd64.deb
├── hello_2.10-2.debian.tar.xz
├── hello_2.10-2.dsc
├── hello_2.10.orig.tar.gz
└── hello-dbgsym_2.10-2_amd64.deb
All added files in the archive dataset are managed by git-annex
, meaning
only their file identity (checksum) is tracked with Git, not their large
content. The files in db/
are required for reprepro
to run properly
on subsequent updates. A dedicated configuration keeps them in an "unlocked"
state for interoperability with reprepro
. All other files are technically
symlinks into the file content "annex" operated by git-annex
.
A webserver can expose the www/
directory as a fully functional APT
archive. However, www/
is actually a dedicated DataLad (sub)dataset, which
can also be cloned to a different location, and updates can be propagated to it
via datalad update
at any desired interval.
Moreover, the www/
subdataset can also be checked-out at any captured archive
update state (e.g. its state on a particular). This makes it possible to
provide any snapshot of the entire APT archive in a format that is immediately
accessible to any apt
client.
In between archive dataset updates, it is not necessary to keep the distribution and package datasets around. To avoid accumulation of disk space demands, these can be dropped:
datalad drop -d . --what all -r distributions
Dropping is a safe operation. DataLad verifies that all file content and the
checked-out dataset state remains available from other clones when the local
clones are removed. The next run of deb-update-reprepro-repository
will
re-obtain any necessary datasets automatically.
The archive dataset can now be maintained for as long as desired, by repeated
the steps for updating package datasets, registering these updates in their
distribution datasets, and running deb-update-reprepro-repository
to ingest
the updates in the APT archive.
Collaborative package distribution workflow
This part of the document is best read after taking a look at the basic walk-through, in which many key components are explained -- information that will not be repeated here. Instead, here we focus on the nature of associated workspaces and roles involved in the collaborative maintenance and distribution of Debian packages.
Despite the different focus, the basic outcome will be the same as for the previous walk-through: a Debian source package, built for a target Debian release, and distributed via a reprepro-based APT archive.
As a quick reminder, several semantically different types of DataLad datasets are used for package building and distribution/archive maintenance with datalad-debian:
builder dataset: DataLad dataset that tracks one or more (containerized) build environments, set up for use with
datalad containers-run
(provided by the DataLad containers extension package).package dataset: DataLad dataset that tracks versions of a single Debian source package, and the binary packages built for each version across all target architectures. This dataset also tracks the builder dataset with the environment used to build binary packages.
distribution dataset: DataLad dataset that tracks any number of package datasets (one for each Debian source package included in that particular distribution) and a builder dataset with the build environment(s) used to build binary packages for the distribution.
APT-repository dataset: DataLad dataset that tracks the content of an APT repository that would typically be exposed via a web-server (i.e., crucially the dist/, and pool/ directories). For technical or legal reasons (number of files, split free and non-free components, etc.) this dataset may optionally have additional subdatasets.
archive dataset: DataLad dataset that tracks an APT-repository dataset, and any number of distribution datasets used to populate the APT-repository (presently with the reprepro tool).
Roles
Any action associated with building and distributing Debian packages with this system involves the access to, modification and publication (pushing) of DataLad datasets. Individual roles people can have are associated with different sets of required access permissions.
package maintainer: Updates a package dataset with new (backported) versions of a source package. Needs read/write access to a (particular) package dataset.
distribution maintainer: Reviews and accepts/rejects new or updated package datasets (updated by a package maintainer or a package builder) by incorporating them into a distribution dataset. Needs read access to any package dataset and write access to a (particular) distribution dataset.
package builder: Maintains a (containerized) build environment for a distribution in a builder dataset and updates package datasets with built binary Debian packages. Needs read/write access to a builder dataset and package datasets.
archive maintainer: Adds distribution datasets to an archive dataset, and populates the APT-repository dataset from distribution updates. Needs read access to distribution datasets and package datasets, plus write access to archive dataset and APT-repository dataset.
mirror operator: Deploys (updates of) an APT-repository dataset on a particular (additional) server infrastructure. Needs read access to the APT-repository dataset.
It is important to point out that most write access permissions mentioned above do not necessarily demand actual write access to a service infrastructure. Given the decentralized nature of Git and DataLad, any contribution can also be handled via a pull-request-like approach that merely requires a notification of an authorized entity with read access to the modification deposited elsewhere.
Storage and hosting
The decentralized data management approach offers great flexibility regarding choices for storing files and datasets, and environments suitable for performing updates. For small efforts, even the simplistic approach of using a monolithic directory (as shown in the initial walk-through) may be good enough. However, larger operations involving multiple people and different levels of trust benefit from a stricter compartmentalization.
APT-repository dataset checkout: This location is typically exposed via a web-server and the APT client facing end of the system. In addition to being used by APT for its normal operation, it can also be (git/datalad) cloned from (including historic versions), and is capable of hosting and providing previous versions of any repository content in a debian-snapshot like fashion. Only archive maintainer need write access, while read access is typically "public" or anonymous.
Hosting of archive/APT-repository datasets: Especially for large archives, update operations of these two datasets can be expensive, and generally a full (nested) checkout is required for any modification. Consequently, it typically makes sense to maintain a long-lived checkout location for them (possibly in a way that allows for directly exposing the checkout of the APT-repository subdataset via a web server.
Hosting of distribution datasets: These need to have read access for all maintainer roles, and write access for distribution maintainers only. Larger efforts might benefit from a proper Git hosting solution in order to be able to process incoming package update requests via a pull-request or issue tracking system.
Hosting of package datasets: Package maintainers need to deposit source and binary package files in a way that enables package builders and archive maintainers to retrieve them. Any DataLad-supported sibling type can be used, including git-annex aware systems like GIN or RIA-stores, or dual hosting approaches with a Git-hoster for the dataset and a (cloud) storage solution for file content.
Hosting of builder datasets: The requirement for hosting are largely identical to those of package datasets, except that the build environments generally should be accessible to any package maintainer too.
Workspaces for particular actions
The following sections show concrete examples of workflows suitable for collaboratively working with this system. Given the available flexibility these cannot be considered as one-size-fits-all, and there will be hints on possible different approaches that may be better from some.
Importantly, we will focus on a "clean-desk" approach that minimized the number of long-lived locations that imply maintenance and coordination costs.
Create a distribution dataset with a package builder
We start with standard preparation steps of a distribution dataset: create, configure and bootstrap builder, save.
datalad deb-new-distribution bullseye
datalad -C bullseye deb-configure-builder --dataset builder dockerbase=debian:bullseye
datalad -C bullseye deb-bootstrap-builder --dataset builder
datalad -C bullseye save -d . --message "Update distribution builder" builder
We are not planning to keep the just created dataset around in this location, but rather push them to services or locations that are more appropriate for collaboration or archiving. Hence we need to inform the dataset annexes that it is not worth tracking this location. This is not needed for the distribution dataset itself, it has no annex.
datalad -C bullseye/ foreach-dataset --subdatasets-only git annex dead here
We will place the two datasets, distribution and builder dataset in two different locations, with different access permissions matching the differences in target audiences.
For the sake of keeping this example working as a self-contained copy/paste demo, we are only using "RIA"-type DataLad siblings. However, this is not a requirement and alternatives will be pointed out below.
The distribution dataset is put in a place suitable for collaboration. The
create-sibling-ria
call below creates a dataset store and places a dataset
clone in it, configured for group-shared access (here using the dialout
group, simply because it likely is a group that a user trying the demo out is
already part of; in a real-world deployment this might be
bullseye-maintainers
). More typical would be to use
create-sibling-gitlab
or create-sibling-github
(or similar) to
establish a project on a proper Git hosting service that also provides an issue
tracker and pull-request management support to streamline collaborative
maintenance.
datalad -C bullseye/ create-sibling-ria -s origin --new-store-ok \
--shared group --group dialout --alias dist-bullseye \
ria+file:///tmp/wt/gitlab
The builder dataset is tracking large-ish build environment images, and needs
a place to push this file content too. Moreoever, it likely makes sense to limit
push access to a particular group of people. For this demo, we simply use a
different RIA store, with a different group setting (floppy
is again a
random-but-likely-existing choice).
datalad -C bullseye/builder/ create-sibling-ria -s origin --new-store-ok \
--shared group --group floppy --alias builder-bullseye \
ria+file:///tmp/wt/internal
With the remote sibling created, we can push the datasets (recursively, i.e., both together), and drop them entirely from our workspace.
datalad -C bullseye/ push -r --to origin
datalad drop --what all -r -d bullseye
drop
is checking that nothing unrecoverable is left before wiping out the
repositories. Cleaning the workspace completely ensure that any and all content
is placed in proper hosting/archive solutions.
Create an archive dataset
An archive dataset has different hosting demands. The reprepro
tool
essentially requires the full work tree to be present at all times. For large
APT-archives even a clone may take considerable time. Hence we are create the
archive dataset in the location where is would/could live semi-persistently.
We add our distribution dataset from the collaboration-focused dataset store (GitLab placeholder). They need not live on the same machine. Any source URL that DataLad supports is suitable. The distribution dataset clone inside the archive dataset need not stay there permanently, but can be dropped and reobtained as needed.
datalad deb-new-reprepro-repository archive
datalad deb-add-distribution -d archive/ ria+file:///tmp/wt/gitlab#~dist-bullseye bullseye
# minimalistic reprepro config
cat << EOT > archive/conf/distributions
Codename: bullseye
Components: main
Architectures: source amd64
EOT
datalad save -d archive
Add a package to a distribution
With the archive dataset ready, we need to start populating the distribution dataset. Importantly, this need not be done in the existing clone inside the archive datatset, but can be performed in an ephemeral workspace.
We make a temporary clone, and add a package dataset to it.
datalad clone ria+file:///tmp/wt/gitlab#~dist-bullseye dist
datalad deb-new-package -d dist demo
This new package dataset is another item that a group of package maintainers could collaborate on, hence we put this on "GitLab" too.
datalad -C dist/packages/demo/ create-sibling-ria -s origin --alias pkg-demo \
ria+file:///tmp/wt/gitlab
When a distribution maintainer needs to pull an update, we want them to know about this "upstream" location, hence register it as the subdataset URL.
datalad subdatasets -d dist \
--set-property url "$(git -C dist/packages/demo remote get-url origin)" \
dist/packages/demo
As before, this initial location of the newly created package dataset is of no relevance, so we tell the dataset to forget about it, push everything to the respective hosting (incl. the update of the distribution dataset with the addition), and clean the entire workspace.
git -C dist/packages/demo annex dead here
datalad -C dist push --to origin -r
datalad drop --what all -r -d dist
Update a package
Updating a package dataset only requires access to the particular package dataset to be updated, and can, again, be done in an ephemeral workspace. Here we clone via SSH to indicate that this could be performed anywhere.
datalad clone --reckless ephemeral ria+ssh://localhost/tmp/wt/gitlab#~pkg-demo pkg
A key task of updating a package dataset is adding a new source package version. This can involve arbitrary procedures. Here we simply download a ready-made source package from Debian. Alternatively, a source package could be generated via git-buildpackage from a linked packaging repo, or something equivalent.
datalad -C pkg run \
-m "Add version 2.10-2 source package" \
dget -d -u \
https://snapshot.debian.org/archive/debian/20190513T204548Z/pool/main/h/hello/hello_2.10-2.dsc
Using datalad run
automatically tracks the associated provenance and saves
the outcome, so we can, again, push
the result and clean the entire workspace.
datalad -C pkg push
datalad drop --what all -r -d pkg
Update a package in a distribution
Package maintainers updating a package dataset does not automatically alter the state of the package in the context of a particular distribution. Package maintainers need to inform the distribution maintainers about their intention to update a package (e.g., via an issue filed, or a post-update hook trigger, etc.). Once the to-be-updated package is known, as distribution maintainer can perform the update in an ephemeral workspace.
The make a temporary clone of the distribution dataset, obtain the respective package dataset, update it from the upstream location on record (or a new one that was communicated by some external means).
datalad clone --reckless ephemeral ria+file:///tmp/wt/gitlab#~dist-bullseye dist
datalad -C dist get -n packages/demo/
datalad update -d dist -r --how reset packages/demo/
This changes the recorded state of the package dataset within the distribution dataset, equivalent to an update of a versioned link.
It is likely advisable to not rely on the upstream location of the package dataset being persistent. A distribution maintainer can hence push the package dataset to some trusted, internal infrastructure too, in order to make the distribution effort self-sufficient. For this demo, we push to the internal RIA store, but again, any DataLad sibling type would work in principle.
datalad -C dist/packages/demo create-sibling-ria -s internal \
--existing reconfigure --shared group --group floppy \
--alias pkg-demo \
ria+file:///tmp/wt/internal
datalad -C dist/packages/demo push --to internal
All that remains to be done, is to also push the distribution dataset back to "GitLab" and clean the workspace.
datalad -C dist push
datalad drop --what all -r -d dist
Build binary packages for a distribution
Building binary packages from source package can be done by package maintainers and only requires the package dataset, because it also links the builder dataset for a distribution. With the provenance tracking provided by DataLad distribution maintainers could even programmatically verify that a paricular binary package was actually built with the correct environment, and even whether such a build is reproducible. However, builds are also often done by automated systems.
Such a system needs to perform the following steps, in a temporary workspace: first, clone the package dataset (here we take it from the trusted internal storage solution).
datalad clone --reckless ephemeral ria+ssh://localhost/tmp/wt/internal#~pkg-demo pkg
Once the binary packages are built, we want to push them to the internal storage too. We configure a publication dependency to make this happen automatically later on.
datalad -C pkg siblings configure -s origin --publish-depends internal-storage
Now the package can be built. It may be desirable to use automatically update the build environment prior building in some cases. Here we use the exact builder version linked to the package dataset.
datalad -C pkg/ deb-build-package hello_2.10-2.dsc
Once the build succeeded, the outcome can be pushed and the workspace cleaned up.
datalad -C pkg push
# https://github.com/psychoinformatics-de/datalad-debian/issues/118
sudo rm -rf pkg/builder/cache
datalad drop --what all -r -d pkg
Update a package with additional builds in a distribution
Updating a package dataset within a distribution dataset, because additional binary packages were built by a build mainatiner is similar to an update due to a new source package added by a package maintainer. Again possible in a temporary workspace.
datalad clone --reckless ephemeral ria+ssh://localhost/tmp/wt/gitlab#~dist-bullseye dist
The main difference is that we instruct DataLad to retrieve the respective package dataset not from the "upstream" location, but from internal storage.
# configure to only retrieve package datasets from trusted storage
# must not be ria+file:// due to https://github.com/datalad/datalad/issues/6948
DATALAD_GET_SUBDATASET__SOURCE__CANDIDATE__100internal='ria+ssh://localhost/tmp/wt/internal#{id}' \
datalad -C dist get -n packages/demo/
This change makes sure that we need not worry about unapproved upstream modification showing up at this stage.
Now we can update, push, and clean up as usual, and end with an empty workspace.
datalad update -d dist -r --how reset packages/demo/
datalad -C dist push
datalad drop --what all -r -d dist
Ingest package updated into an archive dataset
We can also use the same trick to only pull package datasets from internal storage when updating the archive dataset
DATALAD_GET_SUBDATASET__SOURCE__CANDIDATE__100internal='ria+file:///tmp/wt/internal#{id}' \
datalad -C archive deb-update-reprepro-repository
As explained in the intial walk-through this step automatically detects changes in the linked distributions, and ingests them into the archive. Any and all distribution datasets could be dropped again afterwards to save on storage demands.
Recreate (new) archive dataset from scratch
The order in which the steps above where presented is not strictly defined. Most components can (re)created at a later or different point in time.
Here is an example of how an archive dataset can be created and populated from scratch, given a distribution dataset.
# create new archive
datalad deb-new-reprepro-repository archive
# pull distribution dataset from collab space
datalad deb-add-distribution -d archive/ \
ria+file:///tmp/wt/gitlab#~dist-bullseye bullseye
# configure reprepro is needed
cat << EOT > archive/conf/distributions
Codename: bullseye
Components: main
Architectures: source amd64
EOT
datalad save -d archive
# configure distribution dataset clone to always pull its package
# (sub)datasets from internal storage
datalad configuration -d archive/distributions/bullseye \
--scope local \
set 'datalad.get.subdataset-source-candidate-100internal=ria+file:///tmp/wt/internal#{id}'
# ingest all packages
datalad -C archive deb-update-reprepro-repository
Commands and API
Command line reference
The order of commands follows their logical order of execution in a start-to-end packaging workflow.
datalad deb-new-distribution
Synopsis
datalad deb-new-distribution [-h] [-d DATASET] [-f] [--version] [PATH]
Description
Create a new distribution dataset
A typical distribution dataset contains a 'builder' subdataset with one or more build environments, and a package subdirectory with one subdataset per Debian package. This command creates the initial structure: A top-level dataset under the provided path and a configured 'builder' subdataset underneath.
Examples
Create a new distribution dataset called 'bullseye' in the current directory:
% datalad deb-new-distribution bullseye
Options
PATH
path where the dataset shall be created, directories will be created as necessary. If no location is provided, a dataset will be created in the location specified by --dataset (if given) or the current working directory. Either way the command will error if the target directory is not empty. Use --force to create a dataset in a non-empty directory. Constraints: value must be a string or Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE
-h, --help, --help-np
show this help message. --help-np forcefully disables the use of a pager for displaying the help message
-d DATASET, --dataset DATASET
specify a dataset whose configuration to inspect rather than the global (user) settings. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE
-f, --force
enforce creation of a dataset in a non-empty directory.
--version
show the module and its version which provides the command
datalad deb-configure-builder
Synopsis
datalad deb-configure-builder [-h] [-d DATASET] [-f] [--template PATH] [--version] [property=value [property=value ...]]
Description
Configure a package build environment
A builder is a (containerized) build environment used to build binary Debian packages from Debian source packages. This command is typically run on the builder dataset in a distribution dataset and configures a builder recipe based on a template and user-specified values for the template's placeholders. The resulting recipe will be placed in the 'recipes/' directory of the builder dataset.
The following directory tree illustrates this. The configured builder takes the form of a Singularity recipe here.
bullseye/ <- distribution dataset├── builder <- builder subdataset│ ├── envs│ │ └── README.md│ ├── recipes│ │ ├── README.md│ │ └── singularity-any <- builder configuration│ ├── init <- additional builder content│ │ ├── README.md│ │ ├── finalize/ <- post-processing executables│ │ └── ...
Currently supported templates are
Template 'default'
This is a Singularity recipe with the following configuration items:
dockerbase
(required): name of a Docker base image for the container, i.e. 'debian:bullseye''debian_archive_sections
(optional): which sections of the Debian package archive to enable for APT in the build environment. To enable all sections set to 'main contrib non-free'. Default: 'main'
Any files placed in init/
will be copied into the build environment
when it is being bootstrapped, right after the base operating system was
installed. This can be used to, for example, configure additional
APT sources, by placing a sources.list
file into
init/etc/apt/sources.list.d/...
, and a corresponding GPG key into
init/usr/share/keyrings/...
.
Any executables placed into init/finalize/
will be executed at the
very end of the bootstrapping process. A finalizer (script) could be used
to adjust file permissions, or make arbitrary other modifications without
having to adjust the environment recipe directly.
Examples
Configure the default Singularity recipe in the builder subdataset, executed from a distribution superdataset:
% datalad deb-configure-builder -d builder dockerbase=debian:bullseye
Options
property=value
Values to replace placeholders in the specified template.
-h, --help, --help-np
show this help message. --help-np forcefully disables the use of a pager for displaying the help message
-d DATASET, --dataset DATASET
Specify a builder dataset in which an environment will be defined. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE
-f, --force
enforce creation of a package dataset in a non-empty directory.
--template PATH
Builder recipe template. This is a text file for placeholders in Python string formating syntax. Constraints: value must be a string or value must be NONE [Default: 'default']
--version
show the module and its version which provides the command
datalad deb-bootstrap-builder
Synopsis
datalad deb-bootstrap-builder [-h] [-d DATASET] [--version]
Description
Bootstrap a build environment
This command bootstraps a (containerized) build environment (such as a Singularity container) based on an existing builder configuration (such as a Singularity recipe).
The execution of this command might require administrative privileges and could prompt for a sudo password, for example to build a Singularity image. The resulting bootstrapped build environment will be placed inside of a 'envs/' subdirectory of a 'builder/' dataset.
The following directory tree illustrates this. The configured builder takes the form of a Singularity recipe here.
bullseye <- distribution dataset├── builder <- builder subdataset│ ├── envs│ │ ├── README.md│ │ └── singularity-amd64.sif <- bootstrapped build environment│ └── recipes│ ├── README.md│ └── singularity-any <- builder configuration
Examples
Bootstrap a configured build environment in a builder subdataset, from a distribution dataset:
% datalad deb-bootstrap-builder -d builder
Options
-h, --help, --help-np
show this help message. --help-np forcefully disables the use of a pager for displaying the help message
-d DATASET, --dataset DATASET
specify a builder dataset that contains a build environment configuration. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE
--version
show the module and its version which provides the command
datalad deb-new-package
Synopsis
datalad deb-new-package [-h] [-d DATASET] [-f] [--version] NAME
Description
Create a new package dataset inside of a distribution dataset
In its final stage, a typical package dataset contains the source files, built binaries, and builder subdataset for a Debian package. This command creates the initial structure: A package dataset in the 'package/' subdirectory of a distribution dataset and a 'builder' subdataset underneath it. It should be run in the root of a distribution dataset with a configured and bootstrapped builder, as the distribution's 'builder' subdataset will be registered in the package dataset to be used to build the binaries. To prevent package name clashes within a distribution dataset, it is advisable to use the Debian package's name as the name for the package dataset.
Examples
Create a new package dataset 'hello' in a distribution dataset:
% datalad deb-new-package hello
Options
NAME
name of the package to add to the distribution. Constraints: value must be a string or value must be NONE
-h, --help, --help-np
show this help message. --help-np forcefully disables the use of a pager for displaying the help message
-d DATASET, --dataset DATASET
specify a distribution dataset to add the package to. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE
-f, --force
enforce creation of a package dataset in a non-empty directory.
--version
show the module and its version which provides the command
datalad deb-build-package
Synopsis
datalad deb-build-package [-h] [-d DATASET] [--update-builder] [--version] DSC
Description
Build binary packages
Perform a provenance tracked build of a binary Debian package from a .dsc file in a package dataset. The command relies on a (containerized) build environment within a package's 'builder' subdataset. The 'builder' subdataset can optionally be updated beforehand.
Beyond binary .deb files, this command creates a .changes, a .buildinfo, and a logs/.txt file with build metadata and provenance. All resulting files are placed into the root of the package dataset.
Examples
Build a binary package from a Debian package's source .dsc file:
% datalad deb-build-package hello_2.10-2.dsc
Options
DSC
Specify the .dsc source file to build from. Constraints: value must be a string or value must be NONE
-h, --help, --help-np
show this help message. --help-np forcefully disables the use of a pager for displaying the help message
-d DATASET, --dataset DATASET
Specify the package dataset of the to-be-built package. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE
--update-builder
Update the builder subdataset from its origin before package build.
--version
show the module and its version which provides the command
datalad deb-new-reprepro-repository
Synopsis
datalad deb-new-reprepro-repository [-h] [-d DATASET] [-f] [--version] [PATH]
Description
Create a new (reprepro) package repository dataset
Examples
Options
PATH
path where the dataset shall be created, directories will be created as necessary. If no location is provided, a dataset will be created in the location specified by --dataset (if given) or the current working directory. Either way the command will error if the target directory is not empty. Use --force to create a dataset in a non-empty directory. Constraints: value must be a string or Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE
-h, --help, --help-np
show this help message. --help-np forcefully disables the use of a pager for displaying the help message
-d DATASET, --dataset DATASET
specify a dataset whose configuration to inspect rather than the global (user) settings. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE
-f, --force
enforce creation of a dataset in a non-empty directory.
--version
show the module and its version which provides the command
datalad deb-update-reprepro-repository
Synopsis
datalad deb-update-reprepro-repository [-h] [-d DATASET] [--version] [PATH]
Description
Update a (reprepro) Debian archive repository dataset
Examples
Options
PATH
path to constrain the update to. Constraints: value must be a string or Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE
-h, --help, --help-np
show this help message. --help-np forcefully disables the use of a pager for displaying the help message
-d DATASET, --dataset DATASET
specify a dataset to update. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE
--version
show the module and its version which provides the command
datalad deb-add-distribution
Synopsis
datalad deb-add-distribution [-h] [-d DATASET] [--version] SOURCE NAME
Description
Add a distribution dataset to a Debian archive repository dataset
Examples
Options
SOURCE
URL, DataLad resource identifier, local path or instance of distribution dataset to be added. Constraints: value must be a string
NAME
name to add the distribution dataset under (directory distributions/<name>). The name should equal the codename of a configured distribution in the archive. If multiple distribution datasets shall target the same distribution, their name can append a '-<flavor-label>' suffix to the distribution codename. Constraints: value must be a string or value must be NONE
-h, --help, --help-np
show this help message. --help-np forcefully disables the use of a pager for displaying the help message
-d DATASET, --dataset DATASET
specify the Debian archive repository dataset to add the distribution to. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE
--version
show the module and its version which provides the command
Python module reference
The order of commands follows their logical order of execution in a start-to-end packaging workflow.
|
Create a new distribution dataset |
|
Configure a package build environment |
|
Bootstrap a build environment |
|
Create a new package dataset inside of a distribution dataset |
|
Build binary packages |
|
Create a new (reprepro) package repository dataset |
|
Update a (reprepro) Debian archive repository dataset |
|
Add a distribution dataset to a Debian archive repository dataset |