Discussion:
[gentoo-dev] [pre-GLEP] Gentoo binary package container format
Michał Górny
2018-11-17 11:21:40 UTC
Permalink
Hi,

Here's a pre-GLEP draft based on the earlier discussion on gentoo-
portage-dev mailing list. The specification uses GLEP form as it
provides for cleanly specifying the motivation and rationale.

(Note: the number assignment is not official, just took the next number
to satisfy the glep converter script)

Also available via HTTPS:

rst: https://dev.gentoo.org/~mgorny/tmp/glep-0078.rst
html: https://dev.gentoo.org/~mgorny/tmp/glep-0078.html

---
GLEP: 78
Title: Gentoo binary package container format
Author: Michał Górny <***@gentoo.org>
Type: Standards Track
Status: Draft
Version: 1
Created: 2018-11-15
Last-Modified: 2018-11-16
Post-History: 2018-11-17
Content-Type: text/x-rst
---

Abstract
========

This GLEP proposes a new binary package container format for Gentoo.
The current tbz2/XPAK format is shortly described, and its deficiences
are listed. Accordingly, the requirements for a new format are set
and a gpkg format satisfying them is proposed. The rationale for
various design decisions is provided.


Motivation
==========

The current Portage binary package format
-----------------------------------------

The historical ``.tbz2`` binary package format used by Portage is
a concatenation of two distinct formats: header-oriented compressed .tar
format (used to hold package files) and trailer-oriented custom XPAK
format (used to hold metadata) [#MAN-XPAK]_. The format has already
been extended incompatibly twice.

The first time, support for storing multiple successive builds of binary
package for a single ebuild version has been added. This feature relies
on appending additional hyphen, followed by an integer to the package
filename. It is disabled by default (preserving backwards
compatibility) and controlled by ``binpkg-multi-instance`` feature.

The second time, support for additional compression formats has been
added. When format other than bzip2 is used, the ``.tbz2`` suffix
is replaced by ``.xpak`` and Portage relies on magic bytes to detect
compression used. For backwards compatibility, Portage still defaults
to using bzip2; compression program can be switched using
``BINPKG_COMPRESS`` configuration variable.

Additionally, there have been minor changes to the stored metadata
and file storage policies. In particular, behavior regarding
``INSTALL_MASK``, controllable file compression and stripping has
changed over time.


Problems with the current binary package format
-----------------------------------------------

The following problems were identified with the package format currently
in use:

1. **The packages rely on custom binary archive format to store
metadata.** It is entirely Gentoo invented, and requires dedicated
tooling to work with it. In fact, the reference implementation
in Portage does not even include a CLI tool to work with tbz2
packages; an unofficial implementation is provided as part
of portage-utils toolkit [#PORTAGE-UTILS]_.

2. **The format relies on obscure compressor feature of ignoring
trailing garbage**. While this behavior is traditionally implemented
by many compressors, the original reasons for it have become long
irrelevant and it is not surprising that new compressors do not
support it. In particular, Portage already hit this problem twice:
once when users replaced bzip2 with parallel-capable pbzip2
implementation [#PBZIP2]_, and the second time when support for zstd
compressor was added [#ZSTD]_.

3. **Placing metadata at the end of file makes partial fetches
complex.** While it is technically possible to obtain package
metadata remotely without fetching the whole package, it usually
requires e.g. 2-3 HTTP requests with rather complex driver. For
comparison, if metadata was placed at the beginning of the file,
early-terminated pipeline with a single fetch request would suffice.

4. **Extending the format with OpenPGP signatures is non-trivial.**
Depending on the implementation details, it either requires fetching
additional detached signature, breaking backwards compatibility or
introducing more custom logic to reassemble OpenPGP packets.

5. **Metadata is not compressed.** This is not a significant problem,
it is just listed for completeness.


Goals for a new container format
--------------------------------

The following goals have been set for a replacement format:

1. **The packages must remain contained in a single file.** As a matter
of user convenience, it should be possible to transfer binary
packages without having to use multiple files, and to install them
from any location.

2. **The file format must be entirely based on common file formats,
respecting best practices, with as little customization as necessary
to satisfy the requirements.** In particular, it is unacceptable
to create new binary formats.

3. **The file format should provide for partial fetching of binary
packages.** It should be possible to easily fetch and read
the package metadata without having to download the whole package.

4. **The file format must provide support for OpenPGP signatures.**
Preferably, it should use standard OpenPGP message formats.

5. **The file format must allow for efficient metadata updates.**
In particular, it should be possible to update the metadata without
having to recompress package files.

6. **The file format should account for easy recognition both through
filename and through contents.** Preferably, it should have distinct
features making it possible to detect it via file(1).

7. **The file format should allow for metadata compression.**

8. **The file format should make future extensions easily possible
without breaking backwards compatibility.**


Specification
=============

The container format
--------------------

The gpkg package container is an uncompressed .tar achive whose filename
uses ``.gpkg.tar`` suffix. This archive contains the following members,
in order:

1. A volume label: ``gpkg: ${full_package_identifier}`` (optional).

2. A signature for the metadata archive: ``metadata.tar${comp}.sig``
(optional).

3. The metadata archive ``metadata.tar${comp}``, optionally compressed
(required).

4. A signature for the filesystem image archive:
``image.tar${comp}.sig`` (optional).

5. The filesystem image archive ``image.tar${comp}``, optionally
compressed (required).

It is recommended that relative order of the archive members is
preserved. However, implementations must support archives with members
out of order.

The container may be extended with additional members in the future.
The implementations should ignore unrecognized members and preserve
them across package updates.


The volume label
----------------

The volume label provides an easy way for users to identify the binary
package without dedicated tooling or specific format knowledge.

The implementations should include a volume label consisting of fixed
string ``gpkg:``, followed by a single space, followed by full package
identifier. However, the implementations must not rely on the volume
label being present or attempt to parse its value when it is.

Furthermore, since the volume label is included in the .tar archive
as the first member, it provides a magic string at a fixed location
that can be used by tools such as file(1) to easily distinguish Gentoo
binary packages from regular .tar archives.


The metadata archive
--------------------

The metadata archive stores the package metadata needed for the package
manager to process it. The archive should be included at the beginning
of the binary package in order to make it possible to read it out of
partially fetched binary package, and to avoid fetching the remaining
part of the package if not necessary.

The archive contains a single directory called ``metadata``. In this
directory, the individual metadata keys are stored as files. The exact
keys and metadata format is outside the scope of this specification.

The package manager may need to modify the package metadata. In this
case, it should replace the metadata archive without having to alter
other package members.

The metadata archive can optionally be compressed. It can also be
supplemented with a detached OpenPGP signature.


The image archive
-----------------

The image archive stores all the files to be installed by the binary
package. It should be included as the last of the files in the binary
package container.

The archive contains a single directory called ``image``. Inside this
directory, all package files are stored in filesystem layout, relative
to the root directory.

The image archive can optionally be compressed. It can also be
supplemented with a detached OpenPGP signature.


Archive member compression
--------------------------

The archive members outlined above support optional compression using
one of the compressed file formats supported by the package manager.
The exact list of compression types is outside the scope of this
specification.

The implementations must support archive members being uncompressed,
and must support using different compression types for different files.

When compressing an archive member, the member filename should be
suffixed using the standard suffix for the particular compressed file
type (e.g. ``.bz2`` for bzip2 format).


OpenPGP member signatures
-------------------------

The archive members support optional OpenPGP signatures.
The implementations must allow the user to specify whether OpenPGP
signatures are to be expected in remotely fetched packages.

If the signatures are expected and the archive member is unsigned, the
package manager must reject processing it. If the signature does not
verify, the package manager must reject processing the corresponding
archive member. In particular, it must not attempt decompressing
compressed members in those circumstances.

If the implementation needs to manipulate archive members, it must
either create a new signature or discard the existing signature.

The signatures are created as binary detached OpenPGP signature files,
with filename corresponding to the member filename with ``.sig`` suffix
appended.


Rationale
=========

Nested archive format
---------------------

The basic problem in designing the new format was how to embed multiple
data streams (metadata, image) into a single file. Traditionally, this
has been done via using two non-conflicting file formats. However,
while such a solution is clever, it suffers in terms of transparency.

Therefore, it has been established that the new format should really
consist of a single archive format, with all necessary data
transparently accessible inside the file. Consequently, it has been
debated how different parts of binary package data should be stored
inside that archive.

The proposal to continue storing image data as top-level data
in the package format, and store metadata as special directory in that
structure has been discarded as a case of in-band signalling.

Finally, the proposal has been shaped to store different kinds of data
as nested archives in the outer binary package container. Besides
providing a clean way of accessing different kinds of information, it
makes it possible to add separate OpenPGP signatures to them.


Inner vs. outer compression
---------------------------

One of the points in the new format debate was whether the binary
package as a whole should be compressed vs. compressing individual
members. The first option may seem as an obvious choice, especially
given that with a larger data set, the compression may proceed more
effectively. However, it has a single strong disadvantage: compression
prevents random access and manipulation of the binary package members.

While for the purpose of reading binary packages, the problem could be
circumvented through convenient member ordering and avoiding disjoint
reads of the binary package, metadata updates would either require
recompressing the whole package (which could be really time consuming
with large packages) or applying complex techniques such as splitting
the compressed archive into multiple compressed streams.

This considered, the simplest solution is to apply compression to
the individual package members, while leaving the container format
uncompressed. It provides fast random access to the individual members,
as well as capability of updating them without the necessity of
recompressing other files in the container.

This also makes it possible to easily protect compressed files using
standard OpenPGP detached signature format. All this combined,
the package manager may perform partial fetch of binary package, verify
the signature of its metadata member and process it without having to
fetch the potentially-large image part.


Container and archive formats
-----------------------------

During the debate, the actual archive formats to use were considered.
The .tar format seemed an obvious choice for the image archive since
it is the only widely deployed archive format that stores all kinds
of file metadata on POSIX systems. However, multiple options for
the outer format has been debated.

Firstly, the ZIP format has been proposed as the only commonly supported
format supporting adding files from stdin (i.e. making it possible to
pipe the inner archives straight into the container without using
temporary files). However, this format has been clearly rejected
as both not being present in the system set, and being trailer-based
and therefore unusable without having to fetch the whole file.

Secondly, the ar and cpio formats were considered. The former is used
by Debian and its derivative binary packages; the latter is used by Red
Hat derivatives. Both formats have the advantage of having less
historical baggage than .tar, and having less overhead. However, both
are also rather obscure (especially given that ar is actually provided
by GNU binutils rather than as a stand-alone archiver), considered
obsolete by POSIX and both have file size limitations smaller than .tar.

All that considered, it has been decided that there is no purpose
in using a second archive format in the specification unless it has
significant advantage to .tar. Therefore, .tar has also been used
as outer package format, even though it has larger overhead than other
formats (mostly due to padding).


Member ordering
---------------

The member ordering is explicitly specified in order to provide for
trivially reading metadata from partially fetched archives.
By requiring the metadata archive to be stored before the image archive,
the package manager may stop fetching after reading it and save
bandwidth and/or space.


Detached OpenPGP signatures
---------------------------

The use of detached OpenPGP signatures is to provide authenticity checks
for binary packages. Covering the complete members with signatures
provide for trivial verification of all metadata and image contents
respectively, without having to invent custom mechanisms for combining
them. Covering the compressed archives helps to prevent zipbomb
attacks. Covering the individual members rather than the whole package
provides for verification of partially fetched binary packages.


Backwards Compatibility
=======================

The format does not preserve backwards compatibility with the tbz2
packages. It has been established that preserving compatibility with
the old format was impossible without making the new format even worse
than the old one was.

For example, adding any visible members to the tarball would cause
them to be installed to the filesystem by old Portage versions. Working
around this would require some kind of awful hacks that would oppose
the goal of using simple and transparent package format.


Reference Implementation
========================

The proof-of-concept implementation of binary package format converter
is available as xpak2gpkg [#XPAK2GPKG]_. It can be used to easily
create packages in the new format for early inspection.


References
==========

.. [#MAN-XPAK] xpak - The XPAK Data Format used with Portage binary
packages
(https://dev.gentoo.org/~zmedico/portage/doc/man/xpak.5.html)

.. [#PORTAGE-UTILS] portage-utils: Small and fast Portage helper tools
written in C
(https://packages.gentoo.org/packages/app-portage/portage-utils)

.. [#PBZIP2] PBZIP2 - a parallel implementation of the bzip2
block-sorting file compressor
(https://launchpad.net/pbzip2)

.. [#ZSTD] Zstandard - Real-time data compression algorithm
(https://facebook.github.io/zstd/)

.. [#XPAK2GPKG] xpak2gpkg: Proof-of-concept converter from tbz2/xpak
to gpkg binpkg format
(https://github.com/mgorny/xpak2gpkg)


Copyright
=========
This work is licensed under the Creative Commons Attribution-ShareAlike 3.0
Unported License. To view a copy of this license, visit
http://creativecommons.org/licenses/by-sa/3.0/.
--
Best regards,
Michał Górny
Roy Bamford
2018-11-17 14:05:41 UTC
Permalink
Post by Michał Górny
Hi,
Here's a pre-GLEP draft based on the earlier discussion on gentoo-
portage-dev mailing list. The specification uses GLEP form as it
provides for cleanly specifying the motivation and rationale.
[snip glep proposal]
--
Best regards,
Michał Górny
Team,

One of the attractions of the existing format is that
tar xf /path/to/tarball -C /mnt/gentoo
works to fix things like glibc being removed and other
missing essential portage components.

In effect, each binary package can be treated as a
single package stage3 when a user needs a get out of jail
free card.

Does this proposal allow for installing the payload without
the use of the Gentoo package manager from some random
distro being used as a rescue media?
--
Regards,

Roy Bamford
(Neddyseagoon) a member of
elections
gentoo-ops
forum-mods
Rich Freeman
2018-11-17 14:17:09 UTC
Permalink
Post by Roy Bamford
Does this proposal allow for installing the payload without
the use of the Gentoo package manager from some random
distro being used as a rescue media?
Yes, it is a tarball of tarballs. There would be an extra step, but a
vanilla tarball containing the files to be extracted could be
extracted as long as you have tar and the appropriate decompressor
(not specified and could change, but I imagine it will remain bzip2
for now).
--
Rich
Michał Górny
2018-11-17 21:53:55 UTC
Permalink
Post by Roy Bamford
Post by Michał Górny
Hi,
Here's a pre-GLEP draft based on the earlier discussion on gentoo-
portage-dev mailing list. The specification uses GLEP form as it
provides for cleanly specifying the motivation and rationale.
[snip glep proposal]
--
Best regards,
Michał Górny
Team,
One of the attractions of the existing format is that
tar xf /path/to/tarball -C /mnt/gentoo
works to fix things like glibc being removed and other
missing essential portage components.
In effect, each binary package can be treated as a
single package stage3 when a user needs a get out of jail
free card.
Does this proposal allow for installing the payload without
the use of the Gentoo package manager from some random
distro being used as a rescue media?
Yes, and it can also be done via one-liner, though it's going to be more
complex than before, e.g.:

tar -xOf mypackage-1.gpkg.tar mypackage-1/image.tar.lz |
tar --lzip -x -C /mnt/gentoo --strip-components 1

Though I wouldn't recommend using it but instead unpacking it normally
and inspecting the contents first.
--
Best regards,
Michał Górny
Fabian Groffen
2018-11-18 09:16:44 UTC
Permalink
Post by Michał Górny
Problems with the current binary package format
-----------------------------------------------
The following problems were identified with the package format currently
1. **The packages rely on custom binary archive format to store
metadata.** It is entirely Gentoo invented, and requires dedicated
tooling to work with it. In fact, the reference implementation
in Portage does not even include a CLI tool to work with tbz2
packages; an unofficial implementation is provided as part
of portage-utils toolkit [#PORTAGE-UTILS]_.
I think you should rewrite this section to the argument that the
metadata is hard to edit, and that there is only one tool to do so
(except a python interface from Portage?).
On a separate note, I don't think portage-utils can be considered
"unofficial", it is a Gentoo official project as far as I am aware.
Post by Michał Górny
2. **The format relies on obscure compressor feature of ignoring
trailing garbage**. While this behavior is traditionally implemented
by many compressors, the original reasons for it have become long
irrelevant and it is not surprising that new compressors do not
once when users replaced bzip2 with parallel-capable pbzip2
implementation [#PBZIP2]_, and the second time when support for zstd
compressor was added [#ZSTD]_.
I think this is actually the result of a rather opportunistic
implementation. The fault is that we chose to use an extension that
suggests the file is a regular compressed tarball.
When one detects that a file is xpak padded, it is trivial to feed the
decompressor just the relevant part of the datastream. The format
itself isn't bad, and doesn't rely on obscure behaviour.
Post by Michał Górny
3. **Placing metadata at the end of file makes partial fetches
complex.** While it is technically possible to obtain package
metadata remotely without fetching the whole package, it usually
requires e.g. 2-3 HTTP requests with rather complex driver. For
comparison, if metadata was placed at the beginning of the file,
early-terminated pipeline with a single fetch request would suffice.
I think this point needs to be quantified somewhat why it is so
important.
I may be wrong, but the average binpkg is small, <1MiB, bigger packages
are <50MiB.
So what is the gain to be saved here? A "few" MiBs for what operation
exactly? I say "few" because I know for some users this is actually not
just a blib before it's downloaded. So if this is possible to achieve,
in what scenarios is this going to be used (and is this often?).
Post by Michał Górny
4. **Extending the format with OpenPGP signatures is non-trivial.**
Depending on the implementation details, it either requires fetching
additional detached signature, breaking backwards compatibility or
introducing more custom logic to reassemble OpenPGP packets.
I think one could add an extra key to the xpak that holds a gpg sig or
something. Perhaps this point is better phrased as that current binpkgs
don't have any validation options defined.
Post by Michał Górny
5. **Metadata is not compressed.** This is not a significant problem,
it is just listed for completeness.
Goals for a new container format
--------------------------------
1. **The packages must remain contained in a single file.** As a matter
of user convenience, it should be possible to transfer binary
packages without having to use multiple files, and to install them
from any location.
2. **The file format must be entirely based on common file formats,
respecting best practices, with as little customization as necessary
to satisfy the requirements.** In particular, it is unacceptable
to create new binary formats.
I take this as your personal opinion. I don't quite get why it is
unacceptable to create a new binary format though. In particular when
you're looking for efficiency, such format could serve your purposes.
As long as it's clearly defined, I don't see the problem with a binary
format either.
Could you add why it is you think binary formats are unacceptable here?
Post by Michał Górny
3. **The file format should provide for partial fetching of binary
packages.** It should be possible to easily fetch and read
the package metadata without having to download the whole package.
Like above, what is the use-case here? Why would you want this? I
think I'm missing something here.
Post by Michał Górny
4. **The file format must provide support for OpenPGP signatures.**
Preferably, it should use standard OpenPGP message formats.
5. **The file format must allow for efficient metadata updates.**
In particular, it should be possible to update the metadata without
having to recompress package files.
6. **The file format should account for easy recognition both through
filename and through contents.** Preferably, it should have distinct
features making it possible to detect it via file(1).
7. **The file format should allow for metadata compression.**
8. **The file format should make future extensions easily possible
without breaking backwards compatibility.**
--
Fabian Groffen
Gentoo on a different level
Michał Górny
2018-11-18 09:38:51 UTC
Permalink
Post by Fabian Groffen
Post by Michał Górny
Problems with the current binary package format
-----------------------------------------------
The following problems were identified with the package format currently
1. **The packages rely on custom binary archive format to store
metadata.** It is entirely Gentoo invented, and requires dedicated
tooling to work with it. In fact, the reference implementation
in Portage does not even include a CLI tool to work with tbz2
packages; an unofficial implementation is provided as part
of portage-utils toolkit [#PORTAGE-UTILS]_.
I think you should rewrite this section to the argument that the
metadata is hard to edit, and that there is only one tool to do so
(except a python interface from Portage?).
On a separate note, I don't think portage-utils can be considered
"unofficial", it is a Gentoo official project as far as I am aware.
In this context, Portage is 'official'. Portage-utils is a project
that's developed entirely separately from Portage and doesn't use
Portage APIs but instead reinvents everything. As such, it is easy for
the two to go out of sync. Or for one of them to have bugs that
the other one doesn't have (say, with endianness).
Post by Fabian Groffen
Post by Michał Górny
2. **The format relies on obscure compressor feature of ignoring
trailing garbage**. While this behavior is traditionally implemented
by many compressors, the original reasons for it have become long
irrelevant and it is not surprising that new compressors do not
once when users replaced bzip2 with parallel-capable pbzip2
implementation [#PBZIP2]_, and the second time when support for zstd
compressor was added [#ZSTD]_.
I think this is actually the result of a rather opportunistic
implementation. The fault is that we chose to use an extension that
suggests the file is a regular compressed tarball.
When one detects that a file is xpak padded, it is trivial to feed the
decompressor just the relevant part of the datastream. The format
itself isn't bad, and doesn't rely on obscure behaviour.
Except if you don't have the proper tools installed. In which case
the 'opportunistic' behavior made it possible to extract the contents
without special tools... except when it actually happens not to work
anymore. Roy's reply indicates that there is actually interest in this
design feature.
Post by Fabian Groffen
Post by Michał Górny
3. **Placing metadata at the end of file makes partial fetches
complex.** While it is technically possible to obtain package
metadata remotely without fetching the whole package, it usually
requires e.g. 2-3 HTTP requests with rather complex driver. For
comparison, if metadata was placed at the beginning of the file,
early-terminated pipeline with a single fetch request would suffice.
I think this point needs to be quantified somewhat why it is so
important.
I may be wrong, but the average binpkg is small, <1MiB, bigger packages
are <50MiB.
So what is the gain to be saved here? A "few" MiBs for what operation
exactly? I say "few" because I know for some users this is actually not
just a blib before it's downloaded. So if this is possible to achieve,
in what scenarios is this going to be used (and is this often?).
Last I checked, Gentoo aimed to support more users than the 'majority'
of people with high-throughput Internet access. If there's no cost
in doing things better, why not do them better?
Post by Fabian Groffen
Post by Michał Górny
4. **Extending the format with OpenPGP signatures is non-trivial.**
Depending on the implementation details, it either requires fetching
additional detached signature, breaking backwards compatibility or
introducing more custom logic to reassemble OpenPGP packets.
I think one could add an extra key to the xpak that holds a gpg sig or
something. Perhaps this point is better phrased as that current binpkgs
don't have any validation options defined.
...which extra key would mean that the two disjoint implementations
in use would need more custom code that extracts the signature,
reconstructs signed data for verification and verifies it. Or, in other
words, that user needs even more custom tooling to manually verify
the package he just fetched.
Post by Fabian Groffen
Post by Michał Górny
5. **Metadata is not compressed.** This is not a significant problem,
it is just listed for completeness.
Goals for a new container format
--------------------------------
1. **The packages must remain contained in a single file.** As a matter
of user convenience, it should be possible to transfer binary
packages without having to use multiple files, and to install them
from any location.
2. **The file format must be entirely based on common file formats,
respecting best practices, with as little customization as necessary
to satisfy the requirements.** In particular, it is unacceptable
to create new binary formats.
I take this as your personal opinion. I don't quite get why it is
unacceptable to create a new binary format though. In particular when
you're looking for efficiency, such format could serve your purposes.
As long as it's clearly defined, I don't see the problem with a binary
format either.
Could you add why it is you think binary formats are unacceptable here?
Because custom binary formats require specialized tooling, and are
a royal PITA when the user wants to do something that the author of
specialized tooling just happened not to think worthwhile, or when
the tooling is not available for some reason. And before you ask really
silly questions, yes, I did fight binary packages over hex editor
at some point.

The most trivial case is an attempted recovery of a broken system.
If you don't have Portage working and don't have portage-utils
installed, do you really prefer a custom format which will require you
to fetch and compile special tools? Or is one that can be processed
with tools you're quite likely to have on every system, like tar?
Post by Fabian Groffen
Post by Michał Górny
3. **The file format should provide for partial fetching of binary
packages.** It should be possible to easily fetch and read
the package metadata without having to download the whole package.
Like above, what is the use-case here? Why would you want this? I
think I'm missing something here.
Does this harm anything? Even if there's little real use for this, is
there any harm in supporting it? Are we supposed to do things the other
way around with no benefit just because you don't see any real use for
it?
Post by Fabian Groffen
Post by Michał Górny
4. **The file format must provide support for OpenPGP signatures.**
Preferably, it should use standard OpenPGP message formats.
5. **The file format must allow for efficient metadata updates.**
In particular, it should be possible to update the metadata without
having to recompress package files.
6. **The file format should account for easy recognition both through
filename and through contents.** Preferably, it should have distinct
features making it possible to detect it via file(1).
7. **The file format should allow for metadata compression.**
8. **The file format should make future extensions easily possible
without breaking backwards compatibility.**
--
Best regards,
Michał Górny
Fabian Groffen
2018-11-18 11:00:48 UTC
Permalink
Post by Michał Górny
Post by Fabian Groffen
Post by Michał Górny
Problems with the current binary package format
-----------------------------------------------
The following problems were identified with the package format currently
1. **The packages rely on custom binary archive format to store
metadata.** It is entirely Gentoo invented, and requires dedicated
tooling to work with it. In fact, the reference implementation
in Portage does not even include a CLI tool to work with tbz2
packages; an unofficial implementation is provided as part
of portage-utils toolkit [#PORTAGE-UTILS]_.
I think you should rewrite this section to the argument that the
metadata is hard to edit, and that there is only one tool to do so
(except a python interface from Portage?).
On a separate note, I don't think portage-utils can be considered
"unofficial", it is a Gentoo official project as far as I am aware.
In this context, Portage is 'official'. Portage-utils is a project
that's developed entirely separately from Portage and doesn't use
Portage APIs but instead reinvents everything. As such, it is easy for
the two to go out of sync. Or for one of them to have bugs that
the other one doesn't have (say, with endianness).
I'm not sure if it's actually true, I was under the impression the same
author(s) worked on the Portage as well as portage-utils code. Anyway,
aren't quickpkg and emerge enough from a user's perspective?
Post by Michał Górny
Post by Fabian Groffen
Post by Michał Górny
2. **The format relies on obscure compressor feature of ignoring
trailing garbage**. While this behavior is traditionally implemented
by many compressors, the original reasons for it have become long
irrelevant and it is not surprising that new compressors do not
once when users replaced bzip2 with parallel-capable pbzip2
implementation [#PBZIP2]_, and the second time when support for zstd
compressor was added [#ZSTD]_.
I think this is actually the result of a rather opportunistic
implementation. The fault is that we chose to use an extension that
suggests the file is a regular compressed tarball.
When one detects that a file is xpak padded, it is trivial to feed the
decompressor just the relevant part of the datastream. The format
itself isn't bad, and doesn't rely on obscure behaviour.
Except if you don't have the proper tools installed. In which case
the 'opportunistic' behavior made it possible to extract the contents
without special tools... except when it actually happens not to work
anymore. Roy's reply indicates that there is actually interest in this
design feature.
Your point is that the format is broken (== relies on obscure compressor
feature). My point is that the format simply requires a special tool.
The fact that we prefer to use existing tools doesn't imply in any way
that the format is broken to me.
I think you should rewrite your point to mention that you don't want to
use a tool that doesn't exist in @system (?) to unpack a binpkg. My
guess is that you could use some head/tail magic in a script if the
trailing block is upsetting the decompressor.

I'm not saying this may look ugly, I'm just saying that your point seems
biased.
Post by Michał Górny
Post by Fabian Groffen
Post by Michał Górny
3. **Placing metadata at the end of file makes partial fetches
complex.** While it is technically possible to obtain package
metadata remotely without fetching the whole package, it usually
requires e.g. 2-3 HTTP requests with rather complex driver. For
comparison, if metadata was placed at the beginning of the file,
early-terminated pipeline with a single fetch request would suffice.
I think this point needs to be quantified somewhat why it is so
important.
I may be wrong, but the average binpkg is small, <1MiB, bigger packages
are <50MiB.
So what is the gain to be saved here? A "few" MiBs for what operation
exactly? I say "few" because I know for some users this is actually not
just a blib before it's downloaded. So if this is possible to achieve,
in what scenarios is this going to be used (and is this often?).
Last I checked, Gentoo aimed to support more users than the 'majority'
of people with high-throughput Internet access. If there's no cost
in doing things better, why not do them better?
You didn't address the critical question, but instead just repeated what
I said.
So again, why do you need to read just the metadata?
Post by Michał Górny
Post by Fabian Groffen
Post by Michał Górny
4. **Extending the format with OpenPGP signatures is non-trivial.**
Depending on the implementation details, it either requires fetching
additional detached signature, breaking backwards compatibility or
introducing more custom logic to reassemble OpenPGP packets.
I think one could add an extra key to the xpak that holds a gpg sig or
something. Perhaps this point is better phrased as that current binpkgs
don't have any validation options defined.
...which extra key would mean that the two disjoint implementations
in use would need more custom code that extracts the signature,
reconstructs signed data for verification and verifies it. Or, in other
words, that user needs even more custom tooling to manually verify
the package he just fetched.
I don't see your point. If you define what the package format looks
like, you just need to implement that. There is no point in having a
binpkg format that Portage doesn't implement properly. Portage is
well-equipped to implement any of the approaches. A user should use
Portage to install a package. A poweruser could use a separate tool for
a scenario where he/she's in charge of keeping things sane. Relevancy?

I just don't agree that extending the format is non-trivial. You seem
to have no arguments other than adding "custom logic", which is what you
eventually also do in the reference implementation of your new approach.
Post by Michał Górny
Post by Fabian Groffen
Post by Michał Górny
5. **Metadata is not compressed.** This is not a significant problem,
it is just listed for completeness.
Goals for a new container format
--------------------------------
1. **The packages must remain contained in a single file.** As a matter
of user convenience, it should be possible to transfer binary
packages without having to use multiple files, and to install them
from any location.
2. **The file format must be entirely based on common file formats,
respecting best practices, with as little customization as necessary
to satisfy the requirements.** In particular, it is unacceptable
to create new binary formats.
I take this as your personal opinion. I don't quite get why it is
unacceptable to create a new binary format though. In particular when
you're looking for efficiency, such format could serve your purposes.
As long as it's clearly defined, I don't see the problem with a binary
format either.
Could you add why it is you think binary formats are unacceptable here?
Because custom binary formats require specialized tooling, and are
a royal PITA when the user wants to do something that the author of
specialized tooling just happened not to think worthwhile, or when
the tooling is not available for some reason. And before you ask really
silly questions, yes, I did fight binary packages over hex editor
at some point.
Which I still don't understand, to be frank. I think even Portage
exposes python APIs to get to the data.
Post by Michał Górny
The most trivial case is an attempted recovery of a broken system.
If you don't have Portage working and don't have portage-utils
installed, do you really prefer a custom format which will require you
to fetch and compile special tools? Or is one that can be processed
with tools you're quite likely to have on every system, like tar?
Well, I think the idea behind the original binpkg format was to use tar
directly on the files in emergency scenarios like these...
The assumption was bzip2 decompressor and tar being available.
I think it is an example of how you add something, while still allowing
to fallback on existing tools.
Post by Michał Górny
Post by Fabian Groffen
Post by Michał Górny
3. **The file format should provide for partial fetching of binary
packages.** It should be possible to easily fetch and read
the package metadata without having to download the whole package.
Like above, what is the use-case here? Why would you want this? I
think I'm missing something here.
Does this harm anything? Even if there's little real use for this, is
there any harm in supporting it? Are we supposed to do things the other
way around with no benefit just because you don't see any real use for
it?
Well, you make a huge point out of it. And if it isn't used, then why
bother so much about it. Then it just looks like you want to use it as
an argument to get rid of something you just don't like.

In my opinion you better just say "hey I would like to implement this
binpkg format, because I think it would be easier to support with
minimal tools since it doesn't have custom features". I would have
nothing against that. Simple and elegant is nice, you don't need to
invent arguments for that, in my opinion.

Fabian
Post by Michał Górny
Post by Fabian Groffen
Post by Michał Górny
4. **The file format must provide support for OpenPGP signatures.**
Preferably, it should use standard OpenPGP message formats.
5. **The file format must allow for efficient metadata updates.**
In particular, it should be possible to update the metadata without
having to recompress package files.
6. **The file format should account for easy recognition both through
filename and through contents.** Preferably, it should have distinct
features making it possible to detect it via file(1).
7. **The file format should allow for metadata compression.**
8. **The file format should make future extensions easily possible
without breaking backwards compatibility.**
--
Best regards,
Michał Górny
--
Fabian Groffen
Gentoo on a different level
Kent Fredric
2018-11-19 20:46:45 UTC
Permalink
On Sun, 18 Nov 2018 12:00:48 +0100
Post by Fabian Groffen
Your point is that the format is broken (== relies on obscure compressor
feature). My point is that the format simply requires a special tool.
The fact that we prefer to use existing tools doesn't imply in any way
that the format is broken to me.
I think you should rewrite your point to mention that you don't want to
guess is that you could use some head/tail magic in a script if the
trailing block is upsetting the decompressor.
The existing design to the best of my understanding poses problems when
it comes to adding new features, as the dependency on a "special tool"
becomes the bottleneck, as in order to add the new feature, the special
tool has to be adjusted to handle it, and potentially introduce serious
incompatible changes.

The alternative proposal stated in this pre-GLEP seems infinitely more
extensible, which means more room for 3rd-parties to add their own
features, while retaining basic portage interop.

For instance, I think a "nice" feature that could be added one day
would be the ability for the automated package builder to bundle:

- The ebuild that was used to build it
- All the eclasses that were used by the ebuild
- All the sources and patches that were used

And therein creating a fat bin/src hybrid, potentially allowing
rebuilding the exact same package with minor changes, independently of
portage repository changes.

And this may be useful for people who don't want the option set in the
binary build, but otherwise want the exact same material in a different
configuration.

In terms of user-friendliness, this could empower Gentoo in new ways,
in ways that compete with existing binary distributions wherein
upstreams publish .deb files for people to "just install".

Presently, the amount of additional hand-holding required (namely:
install this overlay, make sure you sync it right, etc, etc, etc) makes
it a little too "hands on" for some.

Now, I'm not saying Gentoo *should* do exactly this, but I like that
this approach gives us the *potential* to do this, and resultingly,
some downstream derivatives of Gentoo may be motivated to do something
like this, proving usable stand-alone bin-packages which interop nicely
with standard Gentoo installations, while also working nicely with
downstreams customizations.

Achieving this as it is requires downstream to develop their own
format, which is likely not going to work with standard Gentoo installs.
Michał Górny
2018-11-21 09:33:18 UTC
Permalink
Post by Fabian Groffen
Post by Michał Górny
Post by Fabian Groffen
Post by Michał Górny
Problems with the current binary package format
-----------------------------------------------
The following problems were identified with the package format currently
1. **The packages rely on custom binary archive format to store
metadata.** It is entirely Gentoo invented, and requires dedicated
tooling to work with it. In fact, the reference implementation
in Portage does not even include a CLI tool to work with tbz2
packages; an unofficial implementation is provided as part
of portage-utils toolkit [#PORTAGE-UTILS]_.
I think you should rewrite this section to the argument that the
metadata is hard to edit, and that there is only one tool to do so
(except a python interface from Portage?).
On a separate note, I don't think portage-utils can be considered
"unofficial", it is a Gentoo official project as far as I am aware.
In this context, Portage is 'official'. Portage-utils is a project
that's developed entirely separately from Portage and doesn't use
Portage APIs but instead reinvents everything. As such, it is easy for
the two to go out of sync. Or for one of them to have bugs that
the other one doesn't have (say, with endianness).
I'm not sure if it's actually true, I was under the impression the same
author(s) worked on the Portage as well as portage-utils code. Anyway,
aren't quickpkg and emerge enough from a user's perspective?
Gentoo users have a wide perspective. Assuming that you can think of
all things the users need and you don't need to care beyond that
is plain wrong and results in Windows.
Post by Fabian Groffen
Post by Michał Górny
Post by Fabian Groffen
Post by Michał Górny
2. **The format relies on obscure compressor feature of ignoring
trailing garbage**. While this behavior is traditionally implemented
by many compressors, the original reasons for it have become long
irrelevant and it is not surprising that new compressors do not
once when users replaced bzip2 with parallel-capable pbzip2
implementation [#PBZIP2]_, and the second time when support for zstd
compressor was added [#ZSTD]_.
I think this is actually the result of a rather opportunistic
implementation. The fault is that we chose to use an extension that
suggests the file is a regular compressed tarball.
When one detects that a file is xpak padded, it is trivial to feed the
decompressor just the relevant part of the datastream. The format
itself isn't bad, and doesn't rely on obscure behaviour.
Except if you don't have the proper tools installed. In which case
the 'opportunistic' behavior made it possible to extract the contents
without special tools... except when it actually happens not to work
anymore. Roy's reply indicates that there is actually interest in this
design feature.
Your point is that the format is broken (== relies on obscure compressor
feature). My point is that the format simply requires a special tool.
The fact that we prefer to use existing tools doesn't imply in any way
that the format is broken to me.
I think you should rewrite your point to mention that you don't want to
guess is that you could use some head/tail magic in a script if the
trailing block is upsetting the decompressor.
I'm not saying this may look ugly, I'm just saying that your point seems
biased.
I've spent a significant effort rewriting those point to make it clear
what the problem is, and separating it from other changes 'worth doing
while we're changing stuff'. Hope that satisfies your nitpicking.
Post by Fabian Groffen
Post by Michał Górny
Post by Fabian Groffen
Post by Michał Górny
3. **Placing metadata at the end of file makes partial fetches
complex.** While it is technically possible to obtain package
metadata remotely without fetching the whole package, it usually
requires e.g. 2-3 HTTP requests with rather complex driver. For
comparison, if metadata was placed at the beginning of the file,
early-terminated pipeline with a single fetch request would suffice.
I think this point needs to be quantified somewhat why it is so
important.
I may be wrong, but the average binpkg is small, <1MiB, bigger packages
are <50MiB.
So what is the gain to be saved here? A "few" MiBs for what operation
exactly? I say "few" because I know for some users this is actually not
just a blib before it's downloaded. So if this is possible to achieve,
in what scenarios is this going to be used (and is this often?).
Last I checked, Gentoo aimed to support more users than the 'majority'
of people with high-throughput Internet access. If there's no cost
in doing things better, why not do them better?
You didn't address the critical question, but instead just repeated what
I said.
So again, why do you need to read just the metadata?
The original idea was to provide the ability of indexing remote packages
without having a server-side cache available (or up-to-date). In order
to do that, the package manager would need to fetch the metadata of all
packages (but there's no necessity in fetching the whole packages).
However, that's merely a possible future idea. It's not worth debating
today.

Today I really understood the point of avoiding premature optimization.
Even if the change is practically zero-cost and harmless (as it's simply
reordering files), it's going to cost you a lot of time because someone
will keep nitpicking on it, even though any other order will not change
anything.
Post by Fabian Groffen
Post by Michał Górny
Post by Fabian Groffen
Post by Michał Górny
4. **Extending the format with OpenPGP signatures is non-trivial.**
Depending on the implementation details, it either requires fetching
additional detached signature, breaking backwards compatibility or
introducing more custom logic to reassemble OpenPGP packets.
I think one could add an extra key to the xpak that holds a gpg sig or
something. Perhaps this point is better phrased as that current binpkgs
don't have any validation options defined.
...which extra key would mean that the two disjoint implementations
in use would need more custom code that extracts the signature,
reconstructs signed data for verification and verifies it. Or, in other
words, that user needs even more custom tooling to manually verify
the package he just fetched.
I don't see your point. If you define what the package format looks
like, you just need to implement that. There is no point in having a
binpkg format that Portage doesn't implement properly. Portage is
well-equipped to implement any of the approaches. A user should use
Portage to install a package. A poweruser could use a separate tool for
a scenario where he/she's in charge of keeping things sane. Relevancy?
I just don't agree that extending the format is non-trivial. You seem
to have no arguments other than adding "custom logic", which is what you
eventually also do in the reference implementation of your new approach.
The difference is that my format is transparent. You file(1) it, you
see a .tar archive. You extract the archive, you see subarchives
and .sig which are widely recognized. You don't have to read the spec,
you don't have to get special tools. If you ever verified detached
signature, you know how to proceed. If you didn't, you'll learn
something you can reuse.

Now, implementing signatures on top of XPAK is more effort, and yields
something that is more fragile and in the end doesn't benefit anyone.
Post by Fabian Groffen
Post by Michał Górny
Post by Fabian Groffen
Post by Michał Górny
5. **Metadata is not compressed.** This is not a significant problem,
it is just listed for completeness.
Goals for a new container format
--------------------------------
1. **The packages must remain contained in a single file.** As a matter
of user convenience, it should be possible to transfer binary
packages without having to use multiple files, and to install them
from any location.
2. **The file format must be entirely based on common file formats,
respecting best practices, with as little customization as necessary
to satisfy the requirements.** In particular, it is unacceptable
to create new binary formats.
I take this as your personal opinion. I don't quite get why it is
unacceptable to create a new binary format though. In particular when
you're looking for efficiency, such format could serve your purposes.
As long as it's clearly defined, I don't see the problem with a binary
format either.
Could you add why it is you think binary formats are unacceptable here?
Because custom binary formats require specialized tooling, and are
a royal PITA when the user wants to do something that the author of
specialized tooling just happened not to think worthwhile, or when
the tooling is not available for some reason. And before you ask really
silly questions, yes, I did fight binary packages over hex editor
at some point.
Which I still don't understand, to be frank. I think even Portage
exposes python APIs to get to the data.
Compare the time needed to make a trivial (but unforeseen) change
on a format that's transparent vs a format that requires you to learn
its spec and/or API, write a program and debug it.
Post by Fabian Groffen
Post by Michał Górny
The most trivial case is an attempted recovery of a broken system.
If you don't have Portage working and don't have portage-utils
installed, do you really prefer a custom format which will require you
to fetch and compile special tools? Or is one that can be processed
with tools you're quite likely to have on every system, like tar?
Well, I think the idea behind the original binpkg format was to use tar
directly on the files in emergency scenarios like these...
The assumption was bzip2 decompressor and tar being available.
I think it is an example of how you add something, while still allowing
to fallback on existing tools.
Except progress in compressors has made it work less and less reliably.
It's mostly an example how to be *clever*. However, being clever
usually doesn't pay off in the long term, compared to doing things *in a
simple way*.
Post by Fabian Groffen
Post by Michał Górny
Post by Fabian Groffen
Post by Michał Górny
3. **The file format should provide for partial fetching of binary
packages.** It should be possible to easily fetch and read
the package metadata without having to download the whole package.
Like above, what is the use-case here? Why would you want this? I
think I'm missing something here.
Does this harm anything? Even if there's little real use for this, is
there any harm in supporting it? Are we supposed to do things the other
way around with no benefit just because you don't see any real use for
it?
Well, you make a huge point out of it. And if it isn't used, then why
bother so much about it. Then it just looks like you want to use it as
an argument to get rid of something you just don't like.
In my opinion you better just say "hey I would like to implement this
binpkg format, because I think it would be easier to support with
minimal tools since it doesn't have custom features". I would have
nothing against that. Simple and elegant is nice, you don't need to
invent arguments for that, in my opinion.
The spec is now more focused on that.
Post by Fabian Groffen
Fabian
Post by Michał Górny
Post by Fabian Groffen
Post by Michał Górny
4. **The file format must provide support for OpenPGP signatures.**
Preferably, it should use standard OpenPGP message formats.
5. **The file format must allow for efficient metadata updates.**
In particular, it should be possible to update the metadata without
having to recompress package files.
6. **The file format should account for easy recognition both through
filename and through contents.** Preferably, it should have distinct
features making it possible to detect it via file(1).
7. **The file format should allow for metadata compression.**
8. **The file format should make future extensions easily possible
without breaking backwards compatibility.**
--
Best regards,
Michał Górny
--
Best regards,
Michał Górny
Fabian Groffen
2018-11-21 10:45:54 UTC
Permalink
Post by Michał Górny
Post by Fabian Groffen
Post by Michał Górny
Post by Fabian Groffen
Post by Michał Górny
2. **The format relies on obscure compressor feature of ignoring
trailing garbage**. While this behavior is traditionally implemented
by many compressors, the original reasons for it have become long
irrelevant and it is not surprising that new compressors do not
once when users replaced bzip2 with parallel-capable pbzip2
implementation [#PBZIP2]_, and the second time when support for zstd
compressor was added [#ZSTD]_.
I think this is actually the result of a rather opportunistic
implementation. The fault is that we chose to use an extension that
suggests the file is a regular compressed tarball.
When one detects that a file is xpak padded, it is trivial to feed the
decompressor just the relevant part of the datastream. The format
itself isn't bad, and doesn't rely on obscure behaviour.
Except if you don't have the proper tools installed. In which case
the 'opportunistic' behavior made it possible to extract the contents
without special tools... except when it actually happens not to work
anymore. Roy's reply indicates that there is actually interest in this
design feature.
Your point is that the format is broken (== relies on obscure compressor
feature). My point is that the format simply requires a special tool.
The fact that we prefer to use existing tools doesn't imply in any way
that the format is broken to me.
I think you should rewrite your point to mention that you don't want to
guess is that you could use some head/tail magic in a script if the
trailing block is upsetting the decompressor.
I'm not saying this may look ugly, I'm just saying that your point seems
biased.
I've spent a significant effort rewriting those point to make it clear
what the problem is, and separating it from other changes 'worth doing
while we're changing stuff'. Hope that satisfies your nitpicking.
Yes it does, thank you.
Post by Michał Górny
Post by Fabian Groffen
Post by Michał Górny
Post by Fabian Groffen
Post by Michał Górny
3. **Placing metadata at the end of file makes partial fetches
complex.** While it is technically possible to obtain package
metadata remotely without fetching the whole package, it usually
requires e.g. 2-3 HTTP requests with rather complex driver. For
comparison, if metadata was placed at the beginning of the file,
early-terminated pipeline with a single fetch request would suffice.
I think this point needs to be quantified somewhat why it is so
important.
I may be wrong, but the average binpkg is small, <1MiB, bigger packages
are <50MiB.
So what is the gain to be saved here? A "few" MiBs for what operation
exactly? I say "few" because I know for some users this is actually not
just a blib before it's downloaded. So if this is possible to achieve,
in what scenarios is this going to be used (and is this often?).
Last I checked, Gentoo aimed to support more users than the 'majority'
of people with high-throughput Internet access. If there's no cost
in doing things better, why not do them better?
You didn't address the critical question, but instead just repeated what
I said.
So again, why do you need to read just the metadata?
The original idea was to provide the ability of indexing remote packages
without having a server-side cache available (or up-to-date). In order
to do that, the package manager would need to fetch the metadata of all
packages (but there's no necessity in fetching the whole packages).
However, that's merely a possible future idea. It's not worth debating
today.
Today I really understood the point of avoiding premature optimization.
Even if the change is practically zero-cost and harmless (as it's simply
reordering files), it's going to cost you a lot of time because someone
will keep nitpicking on it, even though any other order will not change
anything.
Perhaps next time don't put as much emphasize on it. I can see now what
you aim for, but it simply raises more questions and concerns to me than
it resolves. There is nothing wrong with putting in such future
possibility though, if easily possible and not colliding with anything
else.
Post by Michał Górny
Post by Fabian Groffen
Post by Michał Górny
Post by Fabian Groffen
Post by Michał Górny
4. **Extending the format with OpenPGP signatures is non-trivial.**
Depending on the implementation details, it either requires fetching
additional detached signature, breaking backwards compatibility or
introducing more custom logic to reassemble OpenPGP packets.
I think one could add an extra key to the xpak that holds a gpg sig or
something. Perhaps this point is better phrased as that current binpkgs
don't have any validation options defined.
...which extra key would mean that the two disjoint implementations
in use would need more custom code that extracts the signature,
reconstructs signed data for verification and verifies it. Or, in other
words, that user needs even more custom tooling to manually verify
the package he just fetched.
I don't see your point. If you define what the package format looks
like, you just need to implement that. There is no point in having a
binpkg format that Portage doesn't implement properly. Portage is
well-equipped to implement any of the approaches. A user should use
Portage to install a package. A poweruser could use a separate tool for
a scenario where he/she's in charge of keeping things sane. Relevancy?
I just don't agree that extending the format is non-trivial. You seem
to have no arguments other than adding "custom logic", which is what you
eventually also do in the reference implementation of your new approach.
The difference is that my format is transparent. You file(1) it, you
see a .tar archive. You extract the archive, you see subarchives
and .sig which are widely recognized. You don't have to read the spec,
you don't have to get special tools. If you ever verified detached
signature, you know how to proceed. If you didn't, you'll learn
something you can reuse.
Totally agree.
Post by Michał Górny
Now, implementing signatures on top of XPAK is more effort, and yields
something that is more fragile and in the end doesn't benefit anyone.
I agree this would be more effort, and it'd get complicated in some aspects.
Whether noone benefits from it depends a bit on whether XPAK could
potentially give you performance boosts or memory/storage savings.
Post by Michał Górny
Post by Fabian Groffen
Post by Michał Górny
Post by Fabian Groffen
Post by Michał Górny
5. **Metadata is not compressed.** This is not a significant problem,
it is just listed for completeness.
Goals for a new container format
--------------------------------
1. **The packages must remain contained in a single file.** As a matter
of user convenience, it should be possible to transfer binary
packages without having to use multiple files, and to install them
from any location.
2. **The file format must be entirely based on common file formats,
respecting best practices, with as little customization as necessary
to satisfy the requirements.** In particular, it is unacceptable
to create new binary formats.
I take this as your personal opinion. I don't quite get why it is
unacceptable to create a new binary format though. In particular when
you're looking for efficiency, such format could serve your purposes.
As long as it's clearly defined, I don't see the problem with a binary
format either.
Could you add why it is you think binary formats are unacceptable here?
Because custom binary formats require specialized tooling, and are
a royal PITA when the user wants to do something that the author of
specialized tooling just happened not to think worthwhile, or when
the tooling is not available for some reason. And before you ask really
silly questions, yes, I did fight binary packages over hex editor
at some point.
Which I still don't understand, to be frank. I think even Portage
exposes python APIs to get to the data.
Compare the time needed to make a trivial (but unforeseen) change
on a format that's transparent vs a format that requires you to learn
its spec and/or API, write a program and debug it.
I was under the impression you could unpack a tbz2 into data and xpak,
then unpack both, modify the contents with an editor or whatever, and
then pack the whole stuff back into a tbz2 again. This can be done
worst case scenario by emerge -k <pkg>, modifying the vdb and quickpkg
<pkg> afterwards.
I know that with portage-utils you can do this easily with the qtbz2 and
qxpak commands. No need to do anything with a hex editor, or know
anything about how it's done.
Obvious advantage of your approach is that you don't need q* tools, but
can use tar instead. The editting is as trivial though. In your case
you need a special procedure to reconstruct the binpkg should you want
to keep your special properties (label, order) which equates to q* tools
somewhat.
Post by Michał Górny
Post by Fabian Groffen
Post by Michał Górny
The most trivial case is an attempted recovery of a broken system.
If you don't have Portage working and don't have portage-utils
installed, do you really prefer a custom format which will require you
to fetch and compile special tools? Or is one that can be processed
with tools you're quite likely to have on every system, like tar?
Well, I think the idea behind the original binpkg format was to use tar
directly on the files in emergency scenarios like these...
The assumption was bzip2 decompressor and tar being available.
I think it is an example of how you add something, while still allowing
to fallback on existing tools.
Except progress in compressors has made it work less and less reliably.
It's mostly an example how to be *clever*. However, being clever
usually doesn't pay off in the long term, compared to doing things *in a
simple way*.
We agree it is hackish, and we agree we can do without. You simply
exaggerate the problem, IMO, which mostly isn't there, because it works
fine today. It can also be solved today using shell tools.

% head -c `grep -abo 'XPAKPACK' $EPREFIX/usr/portage/packages/sys-apps/sed-4.5.tbz2 | sed 's/:.*$//'` $EPREFIX/usr/portage/packages/sys-apps/sed-4.5.tbz2 | tar -jxf -

results in no warnings/errors from bzip about trailing garbage, possible
thanks to the spec being smart enough about this.

Not having to do this, when under stress and pressure to restore a
system to get it back into production, is a plus. Though, in that
scenario the trailing garbage warning wouldn't have been that bad
either.
Post by Michał Górny
Post by Fabian Groffen
Post by Michał Górny
Post by Fabian Groffen
Post by Michał Górny
3. **The file format should provide for partial fetching of binary
packages.** It should be possible to easily fetch and read
the package metadata without having to download the whole package.
Like above, what is the use-case here? Why would you want this? I
think I'm missing something here.
Does this harm anything? Even if there's little real use for this, is
there any harm in supporting it? Are we supposed to do things the other
way around with no benefit just because you don't see any real use for
it?
Well, you make a huge point out of it. And if it isn't used, then why
bother so much about it. Then it just looks like you want to use it as
an argument to get rid of something you just don't like.
In my opinion you better just say "hey I would like to implement this
binpkg format, because I think it would be easier to support with
minimal tools since it doesn't have custom features". I would have
nothing against that. Simple and elegant is nice, you don't need to
invent arguments for that, in my opinion.
The spec is now more focused on that.
Thank you, much appreciated.

Fabian
--
Fabian Groffen
Gentoo on a different level
Michał Górny
2018-11-21 11:20:32 UTC
Permalink
Post by Fabian Groffen
Post by Michał Górny
Post by Fabian Groffen
Post by Michał Górny
Post by Fabian Groffen
Post by Michał Górny
5. **Metadata is not compressed.** This is not a significant problem,
it is just listed for completeness.
Goals for a new container format
--------------------------------
1. **The packages must remain contained in a single file.** As a matter
of user convenience, it should be possible to transfer binary
packages without having to use multiple files, and to install them
from any location.
2. **The file format must be entirely based on common file formats,
respecting best practices, with as little customization as necessary
to satisfy the requirements.** In particular, it is unacceptable
to create new binary formats.
I take this as your personal opinion. I don't quite get why it is
unacceptable to create a new binary format though. In particular when
you're looking for efficiency, such format could serve your purposes.
As long as it's clearly defined, I don't see the problem with a binary
format either.
Could you add why it is you think binary formats are unacceptable here?
Because custom binary formats require specialized tooling, and are
a royal PITA when the user wants to do something that the author of
specialized tooling just happened not to think worthwhile, or when
the tooling is not available for some reason. And before you ask really
silly questions, yes, I did fight binary packages over hex editor
at some point.
Which I still don't understand, to be frank. I think even Portage
exposes python APIs to get to the data.
Compare the time needed to make a trivial (but unforeseen) change
on a format that's transparent vs a format that requires you to learn
its spec and/or API, write a program and debug it.
I was under the impression you could unpack a tbz2 into data and xpak,
then unpack both, modify the contents with an editor or whatever, and
then pack the whole stuff back into a tbz2 again. This can be done
worst case scenario by emerge -k <pkg>, modifying the vdb and quickpkg
<pkg> afterwards.
In the described example, the whole necessity of modifying the binary
package arises from it being broken, therefore unsuitable for
'emerge -k'.
Post by Fabian Groffen
I know that with portage-utils you can do this easily with the qtbz2 and
qxpak commands. No need to do anything with a hex editor, or know
anything about how it's done.
Actually, you need to:

a. know that portage-utils has the appropriate tools (it's non-obvious),

b. know how to use portage-utils.

This is non-obvious. It took me a while to figure out that I need to
use qtbz2 before using qxpak (why would it work only on split data when
the format is explicitly written to be used on top of compressed
archive?!).
Post by Fabian Groffen
Obvious advantage of your approach is that you don't need q* tools, but
can use tar instead. The editting is as trivial though. In your case
you need a special procedure to reconstruct the binpkg should you want
to keep your special properties (label, order) which equates to q* tools
somewhat.
Except you don't need to keep them. The spec is quite explicit that
they're optimizations and that the package must work even if they're
lost as a part of editing exercise.
Post by Fabian Groffen
Post by Michał Górny
Post by Fabian Groffen
Post by Michał Górny
The most trivial case is an attempted recovery of a broken system.
If you don't have Portage working and don't have portage-utils
installed, do you really prefer a custom format which will require you
to fetch and compile special tools? Or is one that can be processed
with tools you're quite likely to have on every system, like tar?
Well, I think the idea behind the original binpkg format was to use tar
directly on the files in emergency scenarios like these...
The assumption was bzip2 decompressor and tar being available.
I think it is an example of how you add something, while still allowing
to fallback on existing tools.
Except progress in compressors has made it work less and less reliably.
It's mostly an example how to be *clever*. However, being clever
usually doesn't pay off in the long term, compared to doing things *in a
simple way*.
We agree it is hackish, and we agree we can do without. You simply
exaggerate the problem, IMO, which mostly isn't there, because it works
fine today. It can also be solved today using shell tools.
% head -c `grep -abo 'XPAKPACK' $EPREFIX/usr/portage/packages/sys-apps/sed-4.5.tbz2 | sed 's/:.*$//'` $EPREFIX/usr/portage/packages/sys-apps/sed-4.5.tbz2 | tar -jxf -
results in no warnings/errors from bzip about trailing garbage, possible
thanks to the spec being smart enough about this.
Well, you aren't going to call that simple, are you? Plus, I think your
solution would fail if bzip2 output just happened to contain 'XPAKPACK'
string. Not saying it's likely to happen but relying on fixed strings
not happening accidentally is not good design.
--
Best regards,
Michał Górny
Andrey Utkin
2018-11-26 21:13:53 UTC
Permalink
Post by Fabian Groffen
We agree it is hackish, and we agree we can do without. You simply
exaggerate the problem, IMO, which mostly isn't there, because it works
fine today. It can also be solved today using shell tools.
I am sad that you don't see it as a productivity impediment that the
user is required to know the custom tooling to do even such a trivial
non-standard action as manual extraction.

Maybe I will make myself look bad by admitting this, but I'm not meeting
your expectations. I use Gentoo for ~11 years, and for about one year I
am using my private binpkgs distributed to all my machines (i.e. I have
read binary package guide fair number of times, but I stopped rereading
it when I satisfied my needs). When in need, I still reached to trusty
tar, and I did not even know what are the names of special tools (a
toolchain?) qtbz2 and qxpak.

Just few days ago I messed with binpkgs for investigation purpose. I
just wanted to extract few to somewhere (definitely not into system
root), and read a core dump with GDB asking it to use those extracted
files for debug symbols.

Of course I used `tar xaf`, because what I know is that it's honest tbz2
just with metadata appended.

# tar xaf boost-1.65.0.tbz2

bzip2: (stdin): trailing garbage after EOF ignored

Exit code is 0.
But the notice is annoying (on subconscious level), because Silence Is
Golden - "when a program has nothing interesting or surprising to say,
it should shut up".
Post by Fabian Groffen
% head -c `grep -abo 'XPAKPACK' $EPREFIX/usr/portage/packages/sys-apps/sed-4.5.tbz2 | sed 's/:.*$//'` $EPREFIX/usr/portage/packages/sys-apps/sed-4.5.tbz2 | tar -jxf -
results in no warnings/errors from bzip about trailing garbage, possible
thanks to the spec being smart enough about this.
Thanks, this is a very concise **custom tool** to handle current binpkg
format.
Post by Fabian Groffen
Not having to do this, when under stress and pressure to restore a
system to get it back into production, is a plus. Though, in that
scenario the trailing garbage warning wouldn't have been that bad
either.
When understress and pressure, the irrelevant warning is not bad?
I am sure it is really bad for operator's attention.
Fabian Groffen
2018-11-27 08:32:38 UTC
Permalink
Post by Andrey Utkin
Post by Fabian Groffen
We agree it is hackish, and we agree we can do without. You simply
exaggerate the problem, IMO, which mostly isn't there, because it works
fine today. It can also be solved today using shell tools.
I am sad that you don't see it as a productivity impediment that the
user is required to know the custom tooling to do even such a trivial
non-standard action as manual extraction.
Huh? tar -jxf doesn't do the trick for you?
Post by Andrey Utkin
Maybe I will make myself look bad by admitting this, but I'm not meeting
your expectations. I use Gentoo for ~11 years, and for about one year I
am using my private binpkgs distributed to all my machines (i.e. I have
read binary package guide fair number of times, but I stopped rereading
it when I satisfied my needs). When in need, I still reached to trusty
tar, and I did not even know what are the names of special tools (a
toolchain?) qtbz2 and qxpak.
Just few days ago I messed with binpkgs for investigation purpose. I
just wanted to extract few to somewhere (definitely not into system
root), and read a core dump with GDB asking it to use those extracted
files for debug symbols.
Of course I used `tar xaf`, because what I know is that it's honest tbz2
just with metadata appended.
# tar xaf boost-1.65.0.tbz2
bzip2: (stdin): trailing garbage after EOF ignored
Exit code is 0.
But the notice is annoying (on subconscious level), because Silence Is
Golden - "when a program has nothing interesting or surprising to say,
it should shut up".
You seem to contradict yourself. You didn't know the tools, yet you say
you needed to, to unpack the files. But you show here you just unpacked
the files without said knowledge.
Post by Andrey Utkin
Post by Fabian Groffen
% head -c `grep -abo 'XPAKPACK' $EPREFIX/usr/portage/packages/sys-apps/sed-4.5.tbz2 | sed 's/:.*$//'` $EPREFIX/usr/portage/packages/sys-apps/sed-4.5.tbz2 | tar -jxf -
results in no warnings/errors from bzip about trailing garbage, possible
thanks to the spec being smart enough about this.
Thanks, this is a very concise **custom tool** to handle current binpkg
format.
As is tar followed by tar. The obvious advantage of the latter is that
you don't get a warning which could trigger you into thinking something
is wrong. So, in my opinion, that is a better way of doing it compared
to the current way.
Post by Andrey Utkin
Post by Fabian Groffen
Not having to do this, when under stress and pressure to restore a
system to get it back into production, is a plus. Though, in that
scenario the trailing garbage warning wouldn't have been that bad
either.
When understress and pressure, the irrelevant warning is not bad?
I am sure it is really bad for operator's attention.
I've been using Gentoo binpkgs for a long while, I think something like
~14 years ago when I used them extensively. Perhaps I'm an exception,
but back then I knew already there was an extra bit attached to the
tars, as were all my collegues around me back then. The fact it comes
up now (as a surprise?) maybe means the knowledge has gone. So good
thing we're replacing it with something easier to infer from inspecting
it.

Fabian
--
Fabian Groffen
Gentoo on a different level
Roy Bamford
2018-11-18 11:04:28 UTC
Permalink
Post by Michał Górny
Post by Fabian Groffen
Post by Michał Górny
Problems with the current binary package format
[snip]
Post by Michał Górny
Post by Fabian Groffen
Post by Michał Górny
2. **The format relies on obscure compressor feature of ignoring
trailing garbage**. While this behavior is traditionally
implemented
Post by Fabian Groffen
Post by Michał Górny
by many compressors, the original reasons for it have become
long
Post by Fabian Groffen
Post by Michał Górny
irrelevant and it is not surprising that new compressors do not
support it. In particular, Portage already hit this problem
once when users replaced bzip2 with parallel-capable pbzip2
implementation [#PBZIP2]_, and the second time when support for
zstd
Post by Fabian Groffen
Post by Michał Górny
compressor was added [#ZSTD]_.
I think this is actually the result of a rather opportunistic
implementation. The fault is that we chose to use an extension that
suggests the file is a regular compressed tarball.
When one detects that a file is xpak padded, it is trivial to feed
the
Post by Fabian Groffen
decompressor just the relevant part of the datastream. The format
itself isn't bad, and doesn't rely on obscure behaviour.
Except if you don't have the proper tools installed. In which case
the 'opportunistic' behavior made it possible to extract the contents
without special tools... except when it actually happens not to work
anymore. Roy's reply indicates that there is actually interest in
this
design feature.
[snip]

Team,

I use to post something like https://wiki.gentoo.org/wiki/Fix_My_Gentoo
with a link to Patricks binhost on the forums every three or four months.
It made it worth writing that wiki page anyway.

We still get users removing elements of their toolchain or glbc from time
to time. The requirement that I didn't express very well, is that it shall
be possible to install binary packages without the use of any Gentoo
specific tooling.

The current tarball of tarballs proposal would satisfy that requirement.

Its unlikely that a custom binary format would. Of course, this being
Gentoo someone would write a run anywhere script that did the
unpicking, We already have deb2targz and rpm2targz. We have the
opportunity to design out binpgk2targz before it exists.
--
Regards,

Roy Bamford
(Neddyseagoon) a member of
elections
gentoo-ops
forum-mods
Michał Górny
2018-11-19 18:35:04 UTC
Permalink
Hi,
Post by Michał Górny
Here's a pre-GLEP draft based on the earlier discussion on gentoo-
portage-dev mailing list. The specification uses GLEP form as it
provides for cleanly specifying the motivation and rationale.
Changes in -r1: took into account the feedback and restructured
the motivation into pointing out advantages of the existing format,
and focusing on the two real issues of non-transparency and OpenPGP
implementations deficiencies. Also added a section on why there's no
explicit version number.
Post by Michał Górny
rst: https://dev.gentoo.org/~mgorny/tmp/glep-0078.rst
html: https://dev.gentoo.org/~mgorny/tmp/glep-0078.html
---
GLEP: 9999
Title: Gentoo binary package container format
Author: Michał Górny <***@gentoo.org>
Type: Standards Track
Status: Draft
Version: 1
Created: 2018-11-15
Last-Modified: 2018-11-16
Post-History: 2018-11-17
Content-Type: text/x-rst
---

Abstract
========

This GLEP proposes a new binary package container format for Gentoo.
The current tbz2/XPAK format is shortly described, and its deficiences
are explained. Accordingly, the requirements for a new format are set
and a gpkg format satisfying them is proposed. The rationale for
the design decisions is provided.


Motivation
==========

The current Portage binary package format
-----------------------------------------

The historical ``.tbz2`` binary package format used by Portage is
a concatenation of two distinct formats: header-oriented compressed .tar
format (used to hold package files) and trailer-oriented custom XPAK
format (used to hold metadata) [#MAN-XPAK]_. The format has already
been extended incompatibly twice.

The first time, support for storing multiple successive builds of binary
package for a single ebuild version has been added. This feature relies
on appending additional hyphen, followed by an integer to the package
filename. It is disabled by default (preserving backwards
compatibility) and controlled by ``binpkg-multi-instance`` feature.

The second time, support for additional compression formats has been
added. When format other than bzip2 is used, the ``.tbz2`` suffix
is replaced by ``.xpak`` and Portage relies on magic bytes to detect
compression used. For backwards compatibility, Portage still defaults
to using bzip2; compression program can be switched using
``BINPKG_COMPRESS`` configuration variable.

Additionally, there have been minor changes to the stored metadata
and file storage policies. In particular, behavior regarding
``INSTALL_MASK``, controllable file compression and stripping has
changed over time.


The advantages of tbz2/XPAK format
----------------------------------

The tbz2/XPAK format used by Portage has three interesting features:

1. **Each binary package is fully contained within a single file.**
While this might seem unnecessary, it makes it easier for the user
to transfer binary packages without having to be concerned about
finding all the necessary files to transfer.

2. **The binary packages are compatible with regular compressed
tarballs, most of the time.** With notable exceptions of historical
versions of pbzip2 and the recent zstd compressor, tbz2/XPAK packages
can be extracted using regular tar utility with a compressor
implementation that discards trailing garbage.

3. **The metadata is uncompressed, and can be efficiently accessed
without decompressing package contents.** This includes
the possibility of rewriting it (e.g. as a result of package moves)
without the necessity of repacking the files.


Transparency problem with the current binary package format
-----------------------------------------------------------

Notwithstanding its advantages, the tbz2/XPAK format has a significant
design fault that consists of two issues:

1. **The XPAK format is a custom binary format with explicit use
of binary-encoded file offsets and field lengths.** As such, it is
non-trivial to read or edit without specialized tools. Such tools
are currently implemented separately from the package manager,
as part of the portage-utils toolkit, written in C [#PORTAGE-UTILS]_.

2. **The tarball compatibility feature relies on obscure feature of
ignoring trailing garbage in compressed files**. While this is
implemented consistently in most of the compressors, this feature
is not really a part of specification but rather traditional
behavior. Given that the original reasons for this no longer apply,
new compressor implementations are likely to miss support for this.

Both of the issues make the format hard to use without dedicated tools,
or when the tools misbehave. This impacts the following scenarios:

A. **Using binary packages for system recovery.** In case of serious
breakage, it is really preferable that the format depends on as few
tools a possible, and especially not on Gentoo-specific tools.

B. **Inspecting binary packages in detail exceeding standard package
manager facilities.**

C. **Modifying binary packages in ways not predicted by the package
manager authors.** A real-life example of this is working around
broken ``pkg_*`` phases which prevent the package from being
installed.


OpenPGP extensibility problem
-----------------------------

There are at least three obvious ways in which the current format could
be extended to support OpenPGP signatures, and each of them has its own
distinct problem:

1. **Adding a detached signature.** This option is non-intrusive but
causes the format to no longer be contained in a single file.

2. **Wrapping the package in OpenPGP message format.** This would use
a standard format and make verification and unpacking relatively
easy. However, it would break backwards compatibility and add
explicit dependency on OpenPGP implementation in order to unpack
the package.

3. **Adding OpenPGP signature as extra XPAK member.** This is
the clever solution. It implies strengthening the dependency
on custom tooling, now additionally necessary to extract
the signature and reconstruct the original file to accommodate
verification.


Goals for a new container format
--------------------------------

All of the above considered, the new format should combine
the advantages of the existing format and at the same time address its
deficiencies whenever possible. Furthermore, since a format replacement
is taking place it is worthwhile to consider additional goals that could
be satisfied with little change.

The following obligatory goals have been set for a replacement format:

1. **The packages must remain contained in a single file.** As a matter
of user convenience, it should be possible to transfer binary
packages without having to use multiple files, and to install them
from any location.

2. **The file format must be entirely based on common file formats,
respecting best practices, with as little customization as necessary
to satisfy the requirements.** The format should be transparent
enough to let user inspect and manipulate it without special tooling
or detailed knowledge.

3. **The file format must provide support for OpenPGP signatures.**
Preferably, it should use standard OpenPGP message formats.

4. **The file format must allow for efficient metadata updates.**
In particular, it should be possible to update the metadata without
having to recompress package files.

Additionally, the following optional goals have been noted:

A. **The file format should account for easy recognition both through
filename and through contents.** Preferably, it should have distinct
features making it possible to detect it via file(1).

B. **The file format should provide for partial fetching of binary
packages.** It should be possible to easily fetch and read
the package metadata without having to download the whole package.

C. **The file format should allow for metadata compression.**

D. **The file format should make future extensions easily possible
without breaking backwards compatibility.**


Specification
=============

The container format
--------------------

The gpkg package container is an uncompressed .tar achive whose filename
uses ``.gpkg.tar`` suffix. This archive contains the following members,
in order:

1. A volume label: ``gpkg: ${full_package_identifier}`` (optional).

2. A signature for the metadata archive: ``metadata.tar${comp}.sig``
(optional).

3. The metadata archive ``metadata.tar${comp}``, optionally compressed
(required).

4. A signature for the filesystem image archive:
``image.tar${comp}.sig`` (optional).

5. The filesystem image archive ``image.tar${comp}``, optionally
compressed (required).

It is recommended that relative order of the archive members is
preserved. However, implementations must support archives with members
out of order.

The container may be extended with additional members in the future.
The implementations should ignore unrecognized members and preserve
them across package updates.


The volume label
----------------

The volume label provides an easy way for users to identify the binary
package without dedicated tooling or specific format knowledge.

The implementations should include a volume label consisting of fixed
string ``gpkg:``, followed by a single space, followed by full package
identifier. However, the implementations must not rely on the volume
label being present or attempt to parse its value when it is.

Furthermore, since the volume label is included in the .tar archive
as the first member, it provides a magic string at a fixed location
that can be used by tools such as file(1) to easily distinguish Gentoo
binary packages from regular .tar archives.


The metadata archive
--------------------

The metadata archive stores the package metadata needed for the package
manager to process it. The archive should be included at the beginning
of the binary package in order to make it possible to read it out of
partially fetched binary package, and to avoid fetching the remaining
part of the package if not necessary.

The archive contains a single directory called ``metadata``. In this
directory, the individual metadata keys are stored as files. The exact
keys and metadata format is outside the scope of this specification.

The package manager may need to modify the package metadata. In this
case, it should replace the metadata archive without having to alter
other package members.

The metadata archive can optionally be compressed. It can also be
supplemented with a detached OpenPGP signature.


The image archive
-----------------

The image archive stores all the files to be installed by the binary
package. It should be included as the last of the files in the binary
package container.

The archive contains a single directory called ``image``. Inside this
directory, all package files are stored in filesystem layout, relative
to the root directory.

The image archive can optionally be compressed. It can also be
supplemented with a detached OpenPGP signature.


Archive member compression
--------------------------

The archive members outlined above support optional compression using
one of the compressed file formats supported by the package manager.
The exact list of compression types is outside the scope of this
specification.

The implementations must support archive members being uncompressed,
and must support using different compression types for different files.

When compressing an archive member, the member filename should be
suffixed using the standard suffix for the particular compressed file
type (e.g. ``.bz2`` for bzip2 format).


OpenPGP member signatures
-------------------------

The archive members support optional OpenPGP signatures.
The implementations must allow the user to specify whether OpenPGP
signatures are to be expected in remotely fetched packages.

If the signatures are expected and the archive member is unsigned, the
package manager must reject processing it. If the signature does not
verify, the package manager must reject processing the corresponding
archive member. In particular, it must not attempt decompressing
compressed members in those circumstances.

If the implementation needs to manipulate archive members, it must
either create a new signature or discard the existing signature.

The signatures are created as binary detached OpenPGP signature files,
with filename corresponding to the member filename with ``.sig`` suffix
appended.


Rationale
=========

Nested archive format
---------------------

The basic problem in designing the new format was how to embed multiple
data streams (metadata, image) into a single file. Traditionally, this
has been done via using two non-conflicting file formats. However,
while such a solution is clever, it suffers in terms of transparency.

Therefore, it has been established that the new format should really
consist of a single archive format, with all necessary data
transparently accessible inside the file. Consequently, it has been
debated how different parts of binary package data should be stored
inside that archive.

The proposal to continue storing image data as top-level data
in the package format, and store metadata as special directory in that
structure has been discarded as a case of in-band signalling.

Finally, the proposal has been shaped to store different kinds of data
as nested archives in the outer binary package container. Besides
providing a clean way of accessing different kinds of information, it
makes it possible to add separate OpenPGP signatures to them.


Inner vs. outer compression
---------------------------

One of the points in the new format debate was whether the binary
package as a whole should be compressed vs. compressing individual
members. The first option may seem as an obvious choice, especially
given that with a larger data set, the compression may proceed more
effectively. However, it has a single strong disadvantage: compression
prevents random access and manipulation of the binary package members.

While for the purpose of reading binary packages, the problem could be
circumvented through convenient member ordering and avoiding disjoint
reads of the binary package, metadata updates would either require
recompressing the whole package (which could be really time consuming
with large packages) or applying complex techniques such as splitting
the compressed archive into multiple compressed streams.

This considered, the simplest solution is to apply compression to
the individual package members, while leaving the container format
uncompressed. It provides fast random access to the individual members,
as well as capability of updating them without the necessity of
recompressing other files in the container.

This also makes it possible to easily protect compressed files using
standard OpenPGP detached signature format. All this combined,
the package manager may perform partial fetch of binary package, verify
the signature of its metadata member and process it without having to
fetch the potentially-large image part.


Container and archive formats
-----------------------------

During the debate, the actual archive formats to use were considered.
The .tar format seemed an obvious choice for the image archive since
it is the only widely deployed archive format that stores all kinds
of file metadata on POSIX systems. However, multiple options for
the outer format has been debated.

Firstly, the ZIP format has been proposed as the only commonly supported
format supporting adding files from stdin (i.e. making it possible to
pipe the inner archives straight into the container without using
temporary files). However, this format has been clearly rejected
as both not being present in the system set, and being trailer-based
and therefore unusable without having to fetch the whole file.

Secondly, the ar and cpio formats were considered. The former is used
by Debian and its derivative binary packages; the latter is used by Red
Hat derivatives. Both formats have the advantage of having less
historical baggage than .tar, and having less overhead. However, both
are also rather obscure (especially given that ar is actually provided
by GNU binutils rather than as a stand-alone archiver), considered
obsolete by POSIX and both have file size limitations smaller than .tar.

All that considered, it has been decided that there is no purpose
in using a second archive format in the specification unless it has
significant advantage to .tar. Therefore, .tar has also been used
as outer package format, even though it has larger overhead than other
formats (mostly due to padding).


Member ordering
---------------

The member ordering is explicitly specified in order to provide for
trivially reading metadata from partially fetched archives.
By requiring the metadata archive to be stored before the image archive,
the package manager may stop fetching after reading it and save
bandwidth and/or space.


Detached OpenPGP signatures
---------------------------

The use of detached OpenPGP signatures is to provide authenticity checks
for binary packages. Covering the complete members with signatures
provide for trivial verification of all metadata and image contents
respectively, without having to invent custom mechanisms for combining
them. Covering the compressed archives helps to prevent zipbomb
attacks. Covering the individual members rather than the whole package
provides for verification of partially fetched binary packages.


Format versioning
-----------------

It has been requested that an explicit version identifier is added
into the binary package containers in order to account for possible
incompatible changes in the format. However, such an explicit notion
does not seem necessary.

Firstly, the format is meant to be extensible while preserving backwards
compatibility. If a backwards-incompatible change needs to be done,
and that change does not cause the packages implicitly incompatible
by design, the incompatibility can be easily forced e.g. via renaming
the metadata archive to ``metadata-v2.tar*``.

Secondly, the only really clean place for such a version would be
an additional file which would unnecessary grow the uncompressed
tarball. The label is non-obligatory and user-oriented, and as such can
not be used to carry information significant to the package manager.

Finally, such a version number can be added into the metadata archive
which needs to be processed by the package manager to extract all
significant binary package information.


Backwards Compatibility
=======================

The format does not preserve backwards compatibility with the tbz2
packages. It has been established that preserving compatibility with
the old format was impossible without making the new format even worse
than the old one was.

For example, adding any visible members to the tarball would cause
them to be installed to the filesystem by old Portage versions. Working
around this would require some kind of awful hacks that would oppose
the goal of using simple and transparent package format.


Reference Implementation
========================

The proof-of-concept implementation of binary package format converter
is available as xpak2gpkg [#XPAK2GPKG]_. It can be used to easily
create packages in the new format for early inspection.


References
==========

.. [#MAN-XPAK] xpak - The XPAK Data Format used with Portage binary
packages
(https://dev.gentoo.org/~zmedico/portage/doc/man/xpak.5.html)

.. [#PORTAGE-UTILS] portage-utils: Small and fast Portage helper tools
written in C
(https://packages.gentoo.org/packages/app-portage/portage-utils)

.. [#XPAK2GPKG] xpak2gpkg: Proof-of-concept converter from tbz2/xpak
to gpkg binpkg format
(https://github.com/mgorny/xpak2gpkg)


Copyright
=========
This work is licensed under the Creative Commons Attribution-ShareAlike 3.0
Unported License. To view a copy of this license, visit
http://creativecommons.org/licenses/by-sa/3.0/.
--
Best regards,
Michał Górny
Roy Bamford
2018-11-19 19:21:06 UTC
Permalink
Post by Michał Górny
Hi,
Post by Michał Górny
Here's a pre-GLEP draft based on the earlier discussion on gentoo-
portage-dev mailing list. The specification uses GLEP form as it
provides for cleanly specifying the motivation and rationale.
Changes in -r1: took into account the feedback and restructured
the motivation into pointing out advantages of the existing format,
and focusing on the two real issues of non-transparency and OpenPGP
implementations deficiencies. Also added a section on why there's no
explicit version number.
Post by Michał Górny
rst: https://dev.gentoo.org/~mgorny/tmp/glep-0078.rst
html: https://dev.gentoo.org/~mgorny/tmp/glep-0078.html
[snip]

Team,

Looks good to me. I can manually unpick the binpackage with tar.
Choose, if I will check the signatures or not, then spray files all
over my broken Gentoo with tar in the same way as I do now.

Implementation detail question.
It appears that all members must be signed, or none of them since

"The archive members support optional OpenPGP signatures.
The implementations must allow the user to specify whether OpenPGP
signatures are to be expected in remotely fetched packages."

Or can the user specify that only some elements need to be signed?

Is it a problem if not all elements are signed with the same key?
That could happen if one person makes a binpackage and someone
else updates the metadata.
Post by Michał Górny
--
Best regards,
Michał Górny
--
Regards,

Roy Bamford
(Neddyseagoon) a member of
elections
gentoo-ops
forum-mods
Rich Freeman
2018-11-19 19:33:17 UTC
Permalink
Post by Roy Bamford
"The archive members support optional OpenPGP signatures.
The implementations must allow the user to specify whether OpenPGP
signatures are to be expected in remotely fetched packages."
Or can the user specify that only some elements need to be signed?
Is it a problem if not all elements are signed with the same key?
That could happen if one person makes a binpackage and someone
else updates the metadata.
IMO this is going a bit into PM details for a GLEP that is about
container formats.

Presumably any package manager is going to need to figure out what
keys are/aren't valid and allow the user to configure this behavior.
Users who want to go editing package innards will presumably adjust
their package manager settings to accept their modifications, whether
it means accepting their own sigs or disabling them.
--
Rich
Zac Medico
2018-11-19 19:40:37 UTC
Permalink
Post by Rich Freeman
Post by Roy Bamford
"The archive members support optional OpenPGP signatures.
The implementations must allow the user to specify whether OpenPGP
signatures are to be expected in remotely fetched packages."
Or can the user specify that only some elements need to be signed?
Is it a problem if not all elements are signed with the same key?
That could happen if one person makes a binpackage and someone
else updates the metadata.
IMO this is going a bit into PM details for a GLEP that is about
container formats.
Presumably any package manager is going to need to figure out what
keys are/aren't valid and allow the user to configure this behavior.
Users who want to go editing package innards will presumably adjust
their package manager settings to accept their modifications, whether
it means accepting their own sigs or disabling them.
With the GLEP as it is, the user *must* use a local signing key to sign
installed packages during the installation process if they want to be
able to verify signatures for installed packages at some point in the
future, since the binary package format does not provide a way to use
binary package signatures for this purpose.
--
Thanks,
Zac
Rich Freeman
2018-11-19 19:51:05 UTC
Permalink
Post by Zac Medico
Post by Rich Freeman
Post by Roy Bamford
"The archive members support optional OpenPGP signatures.
The implementations must allow the user to specify whether OpenPGP
signatures are to be expected in remotely fetched packages."
Or can the user specify that only some elements need to be signed?
Is it a problem if not all elements are signed with the same key?
That could happen if one person makes a binpackage and someone
else updates the metadata.
IMO this is going a bit into PM details for a GLEP that is about
container formats.
Presumably any package manager is going to need to figure out what
keys are/aren't valid and allow the user to configure this behavior.
Users who want to go editing package innards will presumably adjust
their package manager settings to accept their modifications, whether
it means accepting their own sigs or disabling them.
With the GLEP as it is, the user *must* use a local signing key to sign
installed packages during the installation process if they want to be
able to verify signatures for installed packages at some point in the
future, since the binary package format does not provide a way to use
binary package signatures for this purpose.
I think we might be talking about different signatures?

I think you're referring to signatures of the package files after they
are installed on the local filesystem, while I'm talking about
verifying the integrity of the package file themselves.

If these signatures are applied to different data then obviously you
couldn't just have the one signature serve double duty (unless you
hung onto the binary package, verified the signature on it, then
verified the package contents against the live filesystem).

The simplest solution would be to do as you seem to be suggesting -
verify the signature on the package before installing it, and then
during installation capture whatever metadata is already supported by
portage and sign that using a user's trusted key.

This seems like the most practical solution in any case since we
aren't likely to ever go down the route of using a single signed
squashfs for /usr like a release-based binary distro might.
--
Rich
Michał Górny
2018-11-20 20:34:57 UTC
Permalink
Post by Roy Bamford
Post by Michał Górny
Hi,
Post by Michał Górny
Here's a pre-GLEP draft based on the earlier discussion on gentoo-
portage-dev mailing list. The specification uses GLEP form as it
provides for cleanly specifying the motivation and rationale.
Changes in -r1: took into account the feedback and restructured
the motivation into pointing out advantages of the existing format,
and focusing on the two real issues of non-transparency and OpenPGP
implementations deficiencies. Also added a section on why there's no
explicit version number.
Post by Michał Górny
rst: https://dev.gentoo.org/~mgorny/tmp/glep-0078.rst
html: https://dev.gentoo.org/~mgorny/tmp/glep-0078.html
[snip]
Team,
Looks good to me. I can manually unpick the binpackage with tar.
Choose, if I will check the signatures or not, then spray files all
over my broken Gentoo with tar in the same way as I do now.
Implementation detail question.
It appears that all members must be signed, or none of them since
"The archive members support optional OpenPGP signatures.
The implementations must allow the user to specify whether OpenPGP
signatures are to be expected in remotely fetched packages."
Or can the user specify that only some elements need to be signed?
This is really out of scope. The only purpose of this paragraph is to
explain that '(optional)' doesn't mean you can safely ignore the lack of
this file.
--
Best regards,
Michał Górny
Roy Bamford
2018-11-19 20:48:37 UTC
Permalink
Post by Rich Freeman
Post by Roy Bamford
"The archive members support optional OpenPGP signatures.
The implementations must allow the user to specify whether OpenPGP
signatures are to be expected in remotely fetched packages."
Or can the user specify that only some elements need to be signed?
Is it a problem if not all elements are signed with the same key?
That could happen if one person makes a binpackage and someone
else updates the metadata.
IMO this is going a bit into PM details for a GLEP that is about
container formats.
Rich,

Not really. The GLEP needs to be clear about the signing.
Is it every element or none?
The GLEP hints that a mix of is possible with

If the implementation needs to manipulate archive members, it must
either create a new signature or discard the existing signature.

An individual binpackage could start life with all elements signed
by the same key.

Some element could be updated and the key for the signature of
that element changed.

Later still, another element can be changed an have its signature
dropped.

Should some combinations have no practical value, they should
not be permitted by the GLEP.
Post by Rich Freeman
--
Rich
--
Regards,

Roy Bamford
(Neddyseagoon) a member of
elections
gentoo-ops
forum-mods
Michał Górny
2018-11-20 20:33:17 UTC
Permalink
Hi,
Post by Michał Górny
Here's a pre-GLEP draft based on the earlier discussion on gentoo-
portage-dev mailing list. The specification uses GLEP form as it
provides for cleanly specifying the motivation and rationale.
Here's third iteration. Changes since r1:
- removed unnecessary OpenPGP details, made them out of scope,
- added explicit section on (lack of) versioning and how to recognize
packages and their compatibility,
- explained why squashfs is a no-go.


---
GLEP: 9999
Title: Gentoo binary package container format
Author: Michał Górny <***@gentoo.org>
Type: Standards Track
Status: Draft
Version: 1
Created: 2018-11-15
Last-Modified: 2018-11-20
Post-History: 2018-11-17
Content-Type: text/x-rst
---

Abstract
========

This GLEP proposes a new binary package container format for Gentoo.
The current tbz2/XPAK format is shortly described, and its deficiences
are explained. Accordingly, the requirements for a new format are set
and a gpkg format satisfying them is proposed. The rationale for
the design decisions is provided.


Motivation
==========

The current Portage binary package format
-----------------------------------------

The historical ``.tbz2`` binary package format used by Portage is
a concatenation of two distinct formats: header-oriented compressed .tar
format (used to hold package files) and trailer-oriented custom XPAK
format (used to hold metadata) [#MAN-XPAK]_. The format has already
been extended incompatibly twice.

The first time, support for storing multiple successive builds of binary
package for a single ebuild version has been added. This feature relies
on appending additional hyphen, followed by an integer to the package
filename. It is disabled by default (preserving backwards
compatibility) and controlled by ``binpkg-multi-instance`` feature.

The second time, support for additional compression formats has been
added. When format other than bzip2 is used, the ``.tbz2`` suffix
is replaced by ``.xpak`` and Portage relies on magic bytes to detect
compression used. For backwards compatibility, Portage still defaults
to using bzip2; compression program can be switched using
``BINPKG_COMPRESS`` configuration variable.

Additionally, there have been minor changes to the stored metadata
and file storage policies. In particular, behavior regarding
``INSTALL_MASK``, controllable file compression and stripping has
changed over time.


The advantages of tbz2/XPAK format
----------------------------------

The tbz2/XPAK format used by Portage has three interesting features:

1. **Each binary package is fully contained within a single file.**
While this might seem unnecessary, it makes it easier for the user
to transfer binary packages without having to be concerned about
finding all the necessary files to transfer.

2. **The binary packages are compatible with regular compressed
tarballs, most of the time.** With notable exceptions of historical
versions of pbzip2 and the recent zstd compressor, tbz2/XPAK packages
can be extracted using regular tar utility with a compressor
implementation that discards trailing garbage.

3. **The metadata is uncompressed, and can be efficiently accessed
without decompressing package contents.** This includes
the possibility of rewriting it (e.g. as a result of package moves)
without the necessity of repacking the files.


Transparency problem with the current binary package format
-----------------------------------------------------------

Notwithstanding its advantages, the tbz2/XPAK format has a significant
design fault that consists of two issues:

1. **The XPAK format is a custom binary format with explicit use
of binary-encoded file offsets and field lengths.** As such, it is
non-trivial to read or edit without specialized tools. Such tools
are currently implemented separately from the package manager,
as part of the portage-utils toolkit, written in C [#PORTAGE-UTILS]_.

2. **The tarball compatibility feature relies on obscure feature of
ignoring trailing garbage in compressed files**. While this is
implemented consistently in most of the compressors, this feature
is not really a part of specification but rather traditional
behavior. Given that the original reasons for this no longer apply,
new compressor implementations are likely to miss support for this.

Both of the issues make the format hard to use without dedicated tools,
or when the tools misbehave. This impacts the following scenarios:

A. **Using binary packages for system recovery.** In case of serious
breakage, it is really preferable that the format depends on as few
tools a possible, and especially not on Gentoo-specific tools.

B. **Inspecting binary packages in detail exceeding standard package
manager facilities.**

C. **Modifying binary packages in ways not predicted by the package
manager authors.** A real-life example of this is working around
broken ``pkg_*`` phases which prevent the package from being
installed.


OpenPGP extensibility problem
-----------------------------

There are at least three obvious ways in which the current format could
be extended to support OpenPGP signatures, and each of them has its own
distinct problem:

1. **Adding a detached signature.** This option is non-intrusive but
causes the format to no longer be contained in a single file.

2. **Wrapping the package in OpenPGP message format.** This would use
a standard format and make verification and unpacking relatively
easy. However, it would break backwards compatibility and add
explicit dependency on OpenPGP implementation in order to unpack
the package.

3. **Adding OpenPGP signature as extra XPAK member.** This is
the clever solution. It implies strengthening the dependency
on custom tooling, now additionally necessary to extract
the signature and reconstruct the original file to accommodate
verification.


Goals for a new container format
--------------------------------

All of the above considered, the new format should combine
the advantages of the existing format and at the same time address its
deficiencies whenever possible. Furthermore, since a format replacement
is taking place it is worthwhile to consider additional goals that could
be satisfied with little change.

The following obligatory goals have been set for a replacement format:

1. **The packages must remain contained in a single file.** As a matter
of user convenience, it should be possible to transfer binary
packages without having to use multiple files, and to install them
from any location.

2. **The file format must be entirely based on common file formats,
respecting best practices, with as little customization as necessary
to satisfy the requirements.** The format should be transparent
enough to let user inspect and manipulate it without special tooling
or detailed knowledge.

3. **The file format must provide support for OpenPGP signatures.**
Preferably, it should use standard OpenPGP message formats.

4. **The file format must allow for efficient metadata updates.**
In particular, it should be possible to update the metadata without
having to recompress package files.

Additionally, the following optional goals have been noted:

A. **The file format should account for easy recognition both through
filename and through contents.** Preferably, it should have distinct
features making it possible to detect it via file(1).

B. **The file format should provide for partial fetching of binary
packages.** It should be possible to easily fetch and read
the package metadata without having to download the whole package.

C. **The file format should allow for metadata compression.**

D. **The file format should make future extensions easily possible
without breaking backwards compatibility.**


Specification
=============

The container format
--------------------

The gpkg package container is an uncompressed .tar achive whose filename
should use ``.gpkg.tar`` suffix. This archive contains the following
members, in order:

1. A volume label: ``gpkg: ${full_package_identifier}`` (optional).

2. A signature for the metadata archive: ``metadata.tar${comp}.sig``
(optional).

3. The metadata archive ``metadata.tar${comp}``, optionally compressed
(required).

4. A signature for the filesystem image archive:
``image.tar${comp}.sig`` (optional).

5. The filesystem image archive ``image.tar${comp}``, optionally
compressed (required).

It is recommended that relative order of the archive members is
preserved. However, implementations must support archives with members
out of order.

The container may be extended with additional members in the future.
The implementations should ignore unrecognized members and preserve
them across package updates.


The volume label
----------------

The volume label provides an easy way for users to identify the binary
package without dedicated tooling or specific format knowledge.

The implementations should include a volume label consisting of fixed
string ``gpkg:``, followed by a single space, followed by full package
identifier. However, the implementations must not rely on the volume
label being present or attempt to parse its value when it is.

Furthermore, since the volume label is included in the .tar archive
as the first member, it provides a magic string at a fixed location
that can be used by tools such as file(1) to easily distinguish Gentoo
binary packages from regular .tar archives.


The metadata archive
--------------------

The metadata archive stores the package metadata needed for the package
manager to process it. The archive should be included at the beginning
of the binary package in order to make it possible to read it out of
partially fetched binary package, and to avoid fetching the remaining
part of the package if not necessary.

The archive contains a single directory called ``metadata``. In this
directory, the individual metadata keys are stored as files. The exact
keys and metadata format is outside the scope of this specification.

The package manager may need to modify the package metadata. In this
case, it should replace the metadata archive without having to alter
other package members.

The metadata archive can optionally be compressed. It can also be
supplemented with a detached OpenPGP signature.


The image archive
-----------------

The image archive stores all the files to be installed by the binary
package. It should be included as the last of the files in the binary
package container.

The archive contains a single directory called ``image``. Inside this
directory, all package files are stored in filesystem layout, relative
to the root directory.

The image archive can optionally be compressed. It can also be
supplemented with a detached OpenPGP signature.


Archive member compression
--------------------------

The archive members outlined above support optional compression using
one of the compressed file formats supported by the package manager.
The exact list of compression types is outside the scope of this
specification.

The implementations must support archive members being uncompressed,
and must support using different compression types for different files.

When compressing an archive member, the member filename should be
suffixed using the standard suffix for the particular compressed file
type (e.g. ``.bz2`` for bzip2 format).


OpenPGP member signatures
-------------------------

The archive members support optional OpenPGP signatures.
The implementations must allow the user to specify whether OpenPGP
signatures are to be expected in remotely fetched packages.

If the signatures are expected and the archive member is unsigned, the
package manager must reject processing it. If the signature does not
verify, the package manager must reject processing the corresponding
archive member. In particular, it must not attempt decompressing
compressed members in those circumstances.

The signatures are created as binary detached OpenPGP signature files,
with filename corresponding to the member filename with ``.sig`` suffix
appended.

The exact details regarding creating and verifying signatures, as well
as maintaining and distributing keys are outside the scope of this
specification.


Versioning and format recognition
---------------------------------

The container format does not provide an explicit magic identifier
or version number. The implementations should recognize binary packages
through recognizing the uncompressed .tar archive format,
and investigating its contents. Generally, the presence of metadata
archive should be sufficient to assume that the package conforms to this
specification.

If the package format needs to be changed in incompatible way, it should
be done in such a way as to make the above check fail. For example,
the metadata archive can be renamed to ``metadata-r1.tar*``.


Rationale
=========

Nested archive format
---------------------

The basic problem in designing the new format was how to embed multiple
data streams (metadata, image) into a single file. Traditionally, this
has been done via using two non-conflicting file formats. However,
while such a solution is clever, it suffers in terms of transparency.

Therefore, it has been established that the new format should really
consist of a single archive format, with all necessary data
transparently accessible inside the file. Consequently, it has been
debated how different parts of binary package data should be stored
inside that archive.

The proposal to continue storing image data as top-level data
in the package format, and store metadata as special directory in that
structure has been discarded as a case of in-band signalling.

Finally, the proposal has been shaped to store different kinds of data
as nested archives in the outer binary package container. Besides
providing a clean way of accessing different kinds of information, it
makes it possible to add separate OpenPGP signatures to them.


Inner vs. outer compression
---------------------------

One of the points in the new format debate was whether the binary
package as a whole should be compressed vs. compressing individual
members. The first option may seem as an obvious choice, especially
given that with a larger data set, the compression may proceed more
effectively. However, it has a single strong disadvantage: compression
prevents random access and manipulation of the binary package members.

While for the purpose of reading binary packages, the problem could be
circumvented through convenient member ordering and avoiding disjoint
reads of the binary package, metadata updates would either require
recompressing the whole package (which could be really time consuming
with large packages) or applying complex techniques such as splitting
the compressed archive into multiple compressed streams.

This considered, the simplest solution is to apply compression to
the individual package members, while leaving the container format
uncompressed. It provides fast random access to the individual members,
as well as capability of updating them without the necessity of
recompressing other files in the container.

This also makes it possible to easily protect compressed files using
standard OpenPGP detached signature format. All this combined,
the package manager may perform partial fetch of binary package, verify
the signature of its metadata member and process it without having to
fetch the potentially-large image part.


Container and archive formats
-----------------------------

During the debate, the actual archive formats to use were considered.
The .tar format seemed an obvious choice for the image archive since
it is the only widely deployed archive format that stores all kinds
of file metadata on POSIX systems. However, multiple options for
the outer format has been debated.

Firstly, the ZIP format has been proposed as the only commonly supported
format supporting adding files from stdin (i.e. making it possible to
pipe the inner archives straight into the container without using
temporary files). However, this format has been clearly rejected
as both not being present in the system set, and being trailer-based
and therefore unusable without having to fetch the whole file.

Secondly, the ar and cpio formats were considered. The former is used
by Debian and its derivative binary packages; the latter is used by Red
Hat derivatives. Both formats have the advantage of having less
historical baggage than .tar, and having less overhead. However, both
are also rather obscure (especially given that ar is actually provided
by GNU binutils rather than as a stand-alone archiver), considered
obsolete by POSIX and both have file size limitations smaller than .tar.

Thirdly, SquashFS was another interesting option. Its main advantage is
transparent compression support and ability to mount as a filesystem.
However, it has a significant implementation complexity, including mount
management and necessity of fallback to unsquashfs. Since the image
needs to be writable for the pre-installation manipulations, using it
via a mount would additionally require some kind of overlay filesystem.
Using it as top-level format has no real gain over a pipeline with tar,
and is certainly less portable. Therefore, there does not seem to be
a benefit in using SquashFS.

All that considered, it has been decided that there is no purpose
in using a second archive format in the specification unless it has
significant advantage to .tar. Therefore, .tar has also been used
as outer package format, even though it has larger overhead than other
formats (mostly due to padding).


Member ordering
---------------

The member ordering is explicitly specified in order to provide for
trivially reading metadata from partially fetched archives.
By requiring the metadata archive to be stored before the image archive,
the package manager may stop fetching after reading it and save
bandwidth and/or space.


Detached OpenPGP signatures
---------------------------

The use of detached OpenPGP signatures is to provide authenticity checks
for binary packages. Covering the complete members with signatures
provide for trivial verification of all metadata and image contents
respectively, without having to invent custom mechanisms for combining
them. Covering the compressed archives helps to prevent zipbomb
attacks. Covering the individual members rather than the whole package
provides for verification of partially fetched binary packages.


Format versioning
-----------------

It has been requested that an explicit version identifier is added
into the binary package containers in order to account for possible
incompatible changes in the format. However, such an explicit notion
does not seem necessary.

Firstly, the format is meant to be extensible while preserving backwards
compatibility. If a backwards-incompatible change needs to be done,
and that change does not cause the packages implicitly incompatible
by design, the incompatibility can be easily forced e.g. via renaming
the metadata archive to ``metadata-r1.tar*``.

Secondly, the only really clean place for such a version would be
an additional file which would unnecessary grow the uncompressed
tarball. The label is non-obligatory and user-oriented, and as such can
not be used to carry information significant to the package manager.

Finally, such a version number can be added into the metadata archive
which needs to be processed by the package manager to extract all
significant binary package information.


Backwards Compatibility
=======================

The format does not preserve backwards compatibility with the tbz2
packages. It has been established that preserving compatibility with
the old format was impossible without making the new format even worse
than the old one was.

For example, adding any visible members to the tarball would cause
them to be installed to the filesystem by old Portage versions. Working
around this would require some kind of awful hacks that would oppose
the goal of using simple and transparent package format.


Reference Implementation
========================

The proof-of-concept implementation of binary package format converter
is available as xpak2gpkg [#XPAK2GPKG]_. It can be used to easily
create packages in the new format for early inspection.


References
==========

.. [#MAN-XPAK] xpak - The XPAK Data Format used with Portage binary
packages
(https://dev.gentoo.org/~zmedico/portage/doc/man/xpak.5.html)

.. [#PORTAGE-UTILS] portage-utils: Small and fast Portage helper tools
written in C
(https://packages.gentoo.org/packages/app-portage/portage-utils)

.. [#XPAK2GPKG] xpak2gpkg: Proof-of-concept converter from tbz2/xpak
to gpkg binpkg format
(https://github.com/mgorny/xpak2gpkg)


Copyright
=========
This work is licensed under the Creative Commons Attribution-ShareAlike 3.0
Unported License. To view a copy of this license, visit
http://creativecommons.org/licenses/by-sa/3.0/.
--
Best regards,
Michał Górny
Fabian Groffen
2018-11-21 13:10:00 UTC
Permalink
Post by Michał Górny
The volume label
----------------
The volume label provides an easy way for users to identify the binary
package without dedicated tooling or specific format knowledge.
The implementations should include a volume label consisting of fixed
string ``gpkg:``, followed by a single space, followed by full package
identifier. However, the implementations must not rely on the volume
label being present or attempt to parse its value when it is.
Furthermore, since the volume label is included in the .tar archive
as the first member, it provides a magic string at a fixed location
that can be used by tools such as file(1) to easily distinguish Gentoo
binary packages from regular .tar archives.
Just for clarity on this point.
Are you proposing that we patch file(1) to print the Volume Header here?
file-5.35 seems to not say much but "tar archive" or "POSIX tar archive"
for tar-files containing a Volume Header as shown by tar -tv.
Post by Michał Górny
Container and archive formats
-----------------------------
During the debate, the actual archive formats to use were considered.
The .tar format seemed an obvious choice for the image archive since
it is the only widely deployed archive format that stores all kinds
of file metadata on POSIX systems. However, multiple options for
the outer format has been debated.
You mention POSIX, which triggered me. I think it would be good to
specify which tar format to use.

POSIX.1-2001/pax format doesn't have a 100/256 char filename length
restriction, which is good but it is not (yet) used by default by GNU
tar. busybox tar can read pax tars, it seems.

Thanks,
Fabian
--
Fabian Groffen
Gentoo on a different level
Michał Górny
2018-11-21 14:21:48 UTC
Permalink
Post by Fabian Groffen
Post by Michał Górny
The volume label
----------------
The volume label provides an easy way for users to identify the binary
package without dedicated tooling or specific format knowledge.
The implementations should include a volume label consisting of fixed
string ``gpkg:``, followed by a single space, followed by full package
identifier. However, the implementations must not rely on the volume
label being present or attempt to parse its value when it is.
Furthermore, since the volume label is included in the .tar archive
as the first member, it provides a magic string at a fixed location
that can be used by tools such as file(1) to easily distinguish Gentoo
binary packages from regular .tar archives.
Just for clarity on this point.
Are you proposing that we patch file(1) to print the Volume Header here?
file-5.35 seems to not say much but "tar archive" or "POSIX tar archive"
for tar-files containing a Volume Header as shown by tar -tv.
I'm wondering about that as well, yes. However, my main idea is to
specifically detect 'gpkg:' there and use it to explicitly identify
the file as Gentoo binary package (and print package name).
Post by Fabian Groffen
Post by Michał Górny
Container and archive formats
-----------------------------
During the debate, the actual archive formats to use were considered.
The .tar format seemed an obvious choice for the image archive since
it is the only widely deployed archive format that stores all kinds
of file metadata on POSIX systems. However, multiple options for
the outer format has been debated.
You mention POSIX, which triggered me. I think it would be good to
specify which tar format to use.
POSIX.1-2001/pax format doesn't have a 100/256 char filename length
restriction, which is good but it is not (yet) used by default by GNU
tar. busybox tar can read pax tars, it seems.
I think the modern GNU tar format is the obvious choice here. I think
it doesn't suffer any portability problems these days, and is more
compact than the PAX format.
--
Best regards,
Michał Górny
Michał Górny
2018-11-26 18:58:16 UTC
Permalink
Here's the newest version.

Changes:

- added explicit notion of parent directory (missing in previous GLEP
but present in implementation),

- explicitly named GNU tar format with list of permitted extensions,

- changed volume label to 'gpkg-1.txt' file to improve portability; made
it explicit version identifier as well,

- added info on other package formats to rationale.


---
GLEP: 9999
Title: Gentoo binary package container format
Author: Michał Górny <***@gentoo.org>
Type: Standards Track
Status: Draft
Version: 1
Created: 2018-11-15
Last-Modified: 2018-11-26
Post-History: 2018-11-17
Content-Type: text/x-rst
---

Abstract
========

This GLEP proposes a new binary package container format for Gentoo.
The current tbz2/XPAK format is shortly described, and its deficiences
are explained. Accordingly, the requirements for a new format are set
and a gpkg format satisfying them is proposed. The rationale for
the design decisions is provided.


Motivation
==========

The current Portage binary package format
-----------------------------------------

The historical ``.tbz2`` binary package format used by Portage is
a concatenation of two distinct formats: header-oriented compressed .tar
format (used to hold package files) and trailer-oriented custom XPAK
format (used to hold metadata) [#MAN-XPAK]_. The format has already
been extended incompatibly twice.

The first time, support for storing multiple successive builds of binary
package for a single ebuild version has been added. This feature relies
on appending additional hyphen, followed by an integer to the package
filename. It is disabled by default (preserving backwards
compatibility) and controlled by ``binpkg-multi-instance`` feature.

The second time, support for additional compression formats has been
added. When format other than bzip2 is used, the ``.tbz2`` suffix
is replaced by ``.xpak`` and Portage relies on magic bytes to detect
compression used. For backwards compatibility, Portage still defaults
to using bzip2; compression program can be switched using
``BINPKG_COMPRESS`` configuration variable.

Additionally, there have been minor changes to the stored metadata
and file storage policies. In particular, behavior regarding
``INSTALL_MASK``, controllable file compression and stripping has
changed over time.


The advantages of tbz2/XPAK format
----------------------------------

The tbz2/XPAK format used by Portage has three interesting features:

1. **Each binary package is fully contained within a single file.**
While this might seem unnecessary, it makes it easier for the user
to transfer binary packages without having to be concerned about
finding all the necessary files to transfer.

2. **The binary packages are compatible with regular compressed
tarballs, most of the time.** With notable exceptions of historical
versions of pbzip2 and the recent zstd compressor, tbz2/XPAK packages
can be extracted using regular tar utility with a compressor
implementation that discards trailing garbage.

3. **The metadata is uncompressed, and can be efficiently accessed
without decompressing package contents.** This includes
the possibility of rewriting it (e.g. as a result of package moves)
without the necessity of repacking the files.


Transparency problem with the current binary package format
-----------------------------------------------------------

Notwithstanding its advantages, the tbz2/XPAK format has a significant
design fault that consists of two issues:

1. **The XPAK format is a custom binary format with explicit use
of binary-encoded file offsets and field lengths.** As such, it is
non-trivial to read or edit without specialized tools. Such tools
are currently implemented separately from the package manager,
as part of the portage-utils toolkit, written in C [#PORTAGE-UTILS]_.

2. **The tarball compatibility feature relies on obscure feature of
ignoring trailing garbage in compressed files**. While this is
implemented consistently in most of the compressors, this feature
is not really a part of specification but rather traditional
behavior. Given that the original reasons for this no longer apply,
new compressor implementations are likely to miss support for this.

Both of the issues make the format hard to use without dedicated tools,
or when the tools misbehave. This impacts the following scenarios:

A. **Using binary packages for system recovery.** In case of serious
breakage, it is really preferable that the format depends on as few
tools a possible, and especially not on Gentoo-specific tools.

B. **Inspecting binary packages in detail exceeding standard package
manager facilities.**

C. **Modifying binary packages in ways not predicted by the package
manager authors.** A real-life example of this is working around
broken ``pkg_*`` phases which prevent the package from being
installed.


OpenPGP extensibility problem
-----------------------------

There are at least three obvious ways in which the current format could
be extended to support OpenPGP signatures, and each of them has its own
distinct problem:

1. **Adding a detached signature.** This option is non-intrusive but
causes the format to no longer be contained in a single file.

2. **Wrapping the package in OpenPGP message format.** This would use
a standard format and make verification and unpacking relatively
easy. However, it would break backwards compatibility and add
explicit dependency on OpenPGP implementation in order to unpack
the package.

3. **Adding OpenPGP signature as extra XPAK member.** This is
the clever solution. It implies strengthening the dependency
on custom tooling, now additionally necessary to extract
the signature and reconstruct the original file to accommodate
verification.


Goals for a new container format
--------------------------------

All of the above considered, the new format should combine
the advantages of the existing format and at the same time address its
deficiencies whenever possible. Furthermore, since a format replacement
is taking place it is worthwhile to consider additional goals that could
be satisfied with little change.

The following obligatory goals have been set for a replacement format:

1. **The packages must remain contained in a single file.** As a matter
of user convenience, it should be possible to transfer binary
packages without having to use multiple files, and to install them
from any location.

2. **The file format must be entirely based on common file formats,
respecting best practices, with as little customization as necessary
to satisfy the requirements.** The format should be transparent
enough to let user inspect and manipulate it without special tooling
or detailed knowledge.

3. **The file format must provide support for OpenPGP signatures.**
Preferably, it should use standard OpenPGP message formats.

4. **The file format must allow for efficient metadata updates.**
In particular, it should be possible to update the metadata without
having to recompress package files.

Additionally, the following optional goals have been noted:

A. **The file format should account for easy recognition both through
filename and through contents.** Preferably, it should have distinct
features making it possible to detect it via file(1).

B. **The file format should provide for partial fetching of binary
packages.** It should be possible to easily fetch and read
the package metadata without having to download the whole package.

C. **The file format should allow for metadata compression.**

D. **The file format should make future extensions easily possible
without breaking backwards compatibility.**


Specification
=============

The container format
--------------------

The gpkg package container is an uncompressed .tar achive whose filename
should use ``.gpkg.tar`` suffix. This archive contains the following
members, all placed in a single directory whose name matches
the basename of the package file, in order:

1. The package identifier file ``gpkg-1.txt`` (required).

2. A signature for the metadata archive: ``metadata.tar${comp}.sig``
(optional).

3. The metadata archive ``metadata.tar${comp}``, optionally compressed
(required).

4. A signature for the filesystem image archive:
``image.tar${comp}.sig`` (optional).

5. The filesystem image archive ``image.tar${comp}``, optionally
compressed (required).

It is recommended that relative order of the archive members is
preserved. However, implementations must support archives with members
out of order.

The container may be extended with additional members in the future.
The implementations should ignore unrecognized members and preserve
them across package updates.


Permitted .tar format features
------------------------------

The tar archives should use either the POSIX ustar format or a subset
of the GNU format with the following (optional) extensions:

- long pathnames and long linknames,

- base-256 encoding of large file sizes.

Other extensions should be avoided whenever possible.


The package identifier file
---------------------------

The package identifier file serves the purpose of identifying the binary
package format and its version.

The implementations must include a package identifier file named
``gpkg-1.txt``. The filename includes package format version;
implementations should reject packages which do not contain this file
as unsupported format.

The file can have any contents. Normally, it should be empty.

Furthermore, this file should be included in the .tar archive
as the first member. This makes it possible to use it as an additional
magic at a fixed location that can be used by tools such as file(1)
to easily distinguish Gentoo binary packages from regular .tar archives.


The metadata archive
--------------------

The metadata archive stores the package metadata needed for the package
manager to process it. The archive should be included at the beginning
of the binary package in order to make it possible to read it out of
partially fetched binary package, and to avoid fetching the remaining
part of the package if not necessary.

The archive contains a single directory called ``metadata``. In this
directory, the individual metadata keys are stored as files. The exact
keys and metadata format is outside the scope of this specification.

The package manager may need to modify the package metadata. In this
case, it should replace the metadata archive without having to alter
other package members.

The metadata archive can optionally be compressed. It can also be
supplemented with a detached OpenPGP signature.


The image archive
-----------------

The image archive stores all the files to be installed by the binary
package. It should be included as the last of the files in the binary
package container.

The archive contains a single directory called ``image``. Inside this
directory, all package files are stored in filesystem layout, relative
to the root directory.

The image archive can optionally be compressed. It can also be
supplemented with a detached OpenPGP signature.


Archive member compression
--------------------------

The archive members outlined above support optional compression using
one of the compressed file formats supported by the package manager.
The exact list of compression types is outside the scope of this
specification.

The implementations must support archive members being uncompressed,
and must support using different compression types for different files.

When compressing an archive member, the member filename should be
suffixed using the standard suffix for the particular compressed file
type (e.g. ``.bz2`` for bzip2 format).


OpenPGP member signatures
-------------------------

The archive members support optional OpenPGP signatures.
The implementations must allow the user to specify whether OpenPGP
signatures are to be expected in remotely fetched packages.

If the signatures are expected and the archive member is unsigned, the
package manager must reject processing it. If the signature does not
verify, the package manager must reject processing the corresponding
archive member. In particular, it must not attempt decompressing
compressed members in those circumstances.

The signatures are created as binary detached OpenPGP signature files,
with filename corresponding to the member filename with ``.sig`` suffix
appended.

The exact details regarding creating and verifying signatures, as well
as maintaining and distributing keys are outside the scope of this
specification.


Rationale
=========

Package formats used by other distributions
-------------------------------------------

The research on the new package format included investigating
the possibility of reusing solutions from other operating system
distributions. While reusing a foreign package format would be
interesting, the differences in Gentoo metadata structure would prevent
any real compatibility. Some degree of compatibility might be achieved
through adapting the Gentoo metadata, however the costs of such
a solution would probably outweigh its usefulness.

Debian and its derivates are using the .deb package format. This is
a nested archive format, with the outer archive being of ar format,
and containing nested tarballs of control information (metadata)
and data [#DEB-FORMAT]_.

Red Hat, its derivates and some less related distributions are using
the RPM format. It is a custom binary format, storing metadata directly
and using a trailer cpio archive to store package files.

Arch Linux is using xz-compressed tarballs (suffixed ``.pkg.tar.xz``)
as its binary package format. The tarballs contain package files
on top-level, with specially named dotfiles used for package metadata.
OpenPGP signatures are stored as detached ``.sig`` files alongside
packages.

Exherbo is using the pbins format. In this format, the binary package
metadata is stored in repository alike ebuilds, and the binary package
files are stored separately and downloaded alike source tarballs.


Nested archive format
---------------------

The basic problem in designing the new format was how to embed multiple
data streams (metadata, image) into a single file. Traditionally, this
has been done via using two non-conflicting file formats. However,
while such a solution is clever, it suffers in terms of transparency.

Therefore, it has been established that the new format should really
consist of a single archive format, with all necessary data
transparently accessible inside the file. Consequently, it has been
debated how different parts of binary package data should be stored
inside that archive.

The proposal to continue storing image data as top-level data
in the package format, and store metadata as special directory in that
structure has been discarded as a case of in-band signalling.

Finally, the proposal has been shaped to store different kinds of data
as nested archives in the outer binary package container. Besides
providing a clean way of accessing different kinds of information, it
makes it possible to add separate OpenPGP signatures to them.


Inner vs. outer compression
---------------------------

One of the points in the new format debate was whether the binary
package as a whole should be compressed vs. compressing individual
members. The first option may seem as an obvious choice, especially
given that with a larger data set, the compression may proceed more
effectively. However, it has a single strong disadvantage: compression
prevents random access and manipulation of the binary package members.

While for the purpose of reading binary packages, the problem could be
circumvented through convenient member ordering and avoiding disjoint
reads of the binary package, metadata updates would either require
recompressing the whole package (which could be really time consuming
with large packages) or applying complex techniques such as splitting
the compressed archive into multiple compressed streams.

This considered, the simplest solution is to apply compression to
the individual package members, while leaving the container format
uncompressed. It provides fast random access to the individual members,
as well as capability of updating them without the necessity of
recompressing other files in the container.

This also makes it possible to easily protect compressed files using
standard OpenPGP detached signature format. All this combined,
the package manager may perform partial fetch of binary package, verify
the signature of its metadata member and process it without having to
fetch the potentially-large image part.


Container and archive formats
-----------------------------

During the debate, the actual archive formats to use were considered.
The .tar format seemed an obvious choice for the image archive since
it is the only widely deployed archive format that stores all kinds
of file metadata on POSIX systems. However, multiple options for
the outer format has been debated.

Firstly, the ZIP format has been proposed as the only commonly supported
format supporting adding files from stdin (i.e. making it possible to
pipe the inner archives straight into the container without using
temporary files). However, this format has been clearly rejected
as both not being present in the system set, and being trailer-based
and therefore unusable without having to fetch the whole file.

Secondly, the ar and cpio formats were considered. The former is used
by Debian and its derivative binary packages; the latter is used by Red
Hat derivatives. Both formats have the advantage of having less
historical baggage than .tar, and having less overhead. However, both
are also rather obscure (especially given that ar is actually provided
by GNU binutils rather than as a stand-alone archiver), considered
obsolete by POSIX and both have file size limitations smaller than .tar.

Thirdly, SquashFS was another interesting option. Its main advantage is
transparent compression support and ability to mount as a filesystem.
However, it has a significant implementation complexity, including mount
management and necessity of fallback to unsquashfs. Since the image
needs to be writable for the pre-installation manipulations, using it
via a mount would additionally require some kind of overlay filesystem.
Using it as top-level format has no real gain over a pipeline with tar,
and is certainly less portable. Therefore, there does not seem to be
a benefit in using SquashFS.

All that considered, it has been decided that there is no purpose
in using a second archive format in the specification unless it has
significant advantage to .tar. Therefore, .tar has also been used
as outer package format, even though it has larger overhead than other
formats (mostly due to padding).


.tar portability issues
-----------------------

The modern .tar dialects could be considered a dirty extensions
of the original .tar format. Three variants may be considered
of interest: POSIX ustar, pax (newer POSIX standard) and GNU tar.
All three formats are supported by GNU tar, whose presence on systems
used to create binary packages could be relied on. Therefore,
the portability concerns are related mostly to being able to read
and modify binary packages in scenarios of GNU tar being unavailable.

For the purpose of this specification, a detailed research
on portability of individual tar features has been conducted.
The research concluded to:

Judging by the test results, the most portability could be
achieved by:

- using strict POSIX ustar format whenever possible,

- using GNU format for long paths (that do not fix in ustar format),

- using base-256 (+ pax if already used) encoding for large files,

- using pax (+ octal or base-256) for high-range/precision
timestamps and user/group identifiers,

- using pax attributes for extended metadata and/or volume label.

It has been determined that for the purpose of binary package we really
only need to be concerned about long paths and huge files. Therefore,
the above was limited to the three first points and a guideline was
formed from them.

Debian has a similar guideline for the inner tar of their package
format has been created [#DEB-FORMAT]_.


Member ordering
---------------

The member ordering is explicitly specified in order to provide for
trivially reading metadata from partially fetched archives.
By requiring the metadata archive to be stored before the image archive,
the package manager may stop fetching after reading it and save
bandwidth and/or space.


Detached OpenPGP signatures
---------------------------

The use of detached OpenPGP signatures is to provide authenticity checks
for binary packages. Covering the complete members with signatures
provide for trivial verification of all metadata and image contents
respectively, without having to invent custom mechanisms for combining
them. Covering the compressed archives helps to prevent zipbomb
attacks. Covering the individual members rather than the whole package
provides for verification of partially fetched binary packages.


Format versioning
-----------------

The format is versioned through an explicit file, with the version
stored in the filename. If the format changes incompatible,
the filename changes and old implementations do not recognize it
as a valid package.

Previously, the format tried to avoid an explicit file for this purpose
and used volume label instead. However, the use of label has been
renounced due to unforeseen portability issues.


Backwards Compatibility
=======================

The format does not preserve backwards compatibility with the tbz2
packages. It has been established that preserving compatibility with
the old format was impossible without making the new format even worse
than the old one was.

For example, adding any visible members to the tarball would cause
them to be installed to the filesystem by old Portage versions. Working
around this would require some kind of awful hacks that would oppose
the goal of using simple and transparent package format.


Reference Implementation
========================

The proof-of-concept implementation of binary package format converter
is available as xpak2gpkg [#XPAK2GPKG]_. It can be used to easily
create packages in the new format for early inspection.


References
==========

.. [#MAN-XPAK] xpak - The XPAK Data Format used with Portage binary
packages
(https://dev.gentoo.org/~zmedico/portage/doc/man/xpak.5.html)

.. [#PORTAGE-UTILS] portage-utils: Small and fast Portage helper tools
written in C
(https://packages.gentoo.org/packages/app-portage/portage-utils)

.. [#DEB-FORMAT] deb(5) — Debian binary package format
(https://manpages.debian.org/unstable/dpkg-dev/deb.5.en.html)

.. [#TAR-PORTABILITY] Michał Górny, Portability of tar features
(https://dev.gentoo.org/~mgorny/articles/portability-of-tar-features.html)

.. [#XPAK2GPKG] xpak2gpkg: Proof-of-concept converter from tbz2/xpak
to gpkg binpkg format
(https://github.com/mgorny/xpak2gpkg)


Copyright
=========
This work is licensed under the Creative Commons Attribution-ShareAlike 3.0
Unported License. To view a copy of this license, visit
http://creativecommons.org/licenses/by-sa/3.0/.
--
Best regards,
Michał Górny
Ulrich Mueller
2018-11-26 19:17:02 UTC
Permalink
Post by Michał Górny
Specification
=============
The container format
--------------------
The gpkg package container is an uncompressed .tar achive whose filename
should use ``.gpkg.tar`` suffix. This archive contains the following
members, all placed in a single directory whose name matches
I see no value in adding another directory indirection, and it will add
more overhead. Also, AFAICS the tar|tar pipeline that you previously
suggested won't work any more (or would at least require additional
arguments).
Post by Michał Górny
1. The package identifier file ``gpkg-1.txt`` (required).
[...]
The implementations must include a package identifier file named
``gpkg-1.txt``. The filename includes package format version;
implementations should reject packages which do not contain this file
as unsupported format.
The file can have any contents. Normally, it should be empty.
If the file is empty, why is it named gpkg-1.txt (instead of just
gpkg-1)?

Ulrich
Michał Górny
2018-11-26 19:51:10 UTC
Permalink
Post by Ulrich Mueller
Post by Michał Górny
Specification
=============
The container format
--------------------
The gpkg package container is an uncompressed .tar achive whose filename
should use ``.gpkg.tar`` suffix. This archive contains the following
members, all placed in a single directory whose name matches
I see no value in adding another directory indirection, and it will add
more overhead.
Tar bomb is not a good design. Given tar padding, there will be no
overhead unless the full path exceeds ustar limits which is unlikely.
Post by Ulrich Mueller
Also, AFAICS the tar|tar pipeline that you previously
suggested won't work any more (or would at least require additional
arguments).
I'm pretty sure the tar pipeline was actually written with account for
the directory.
Post by Ulrich Mueller
Post by Michał Górny
1. The package identifier file ``gpkg-1.txt`` (required).
[...]
The implementations must include a package identifier file named
``gpkg-1.txt``. The filename includes package format version;
implementations should reject packages which do not contain this file
as unsupported format.
The file can have any contents. Normally, it should be empty.
If the file is empty, why is it named gpkg-1.txt (instead of just
gpkg-1)?
*shrug*. I can make it 'gpkg-1' or 'gpkg.1' or whatever you want ;-).
--
Best regards,
Michał Górny
Roy Bamford
2018-11-26 21:43:07 UTC
Permalink
Post by Michał Górny
Here's the newest version.
- added explicit notion of parent directory (missing in previous GLEP
but present in implementation),
- explicitly named GNU tar format with list of permitted extensions,
- changed volume label to 'gpkg-1.txt' file to improve portability;
made
it explicit version identifier as well,
- added info on other package formats to rationale.
[snip]

The image archive stores all the files to be installed by the binary
package. It should be included as the last of the files in the binary
package container.

[snip]
Post by Michał Górny
--
Best regards,
Michał Górny
Its a nit today but that says that any future extensions, none
yet planned, should be placed before the image archive.

The specification needs to avoid the use of relative references.
--
Regards,

Roy Bamford
(Neddyseagoon) a member of
elections
gentoo-ops
forum-mods
Michał Górny
2018-11-30 17:06:31 UTC
Permalink
Post by Roy Bamford
Post by Michał Górny
Here's the newest version.
- added explicit notion of parent directory (missing in previous GLEP
but present in implementation),
- explicitly named GNU tar format with list of permitted extensions,
- changed volume label to 'gpkg-1.txt' file to improve portability;
made
it explicit version identifier as well,
- added info on other package formats to rationale.
[snip]
The image archive stores all the files to be installed by the binary
package. It should be included as the last of the files in the binary
package container.
[snip]
Post by Michał Górny
--
Best regards,
Michał Górny
Its a nit today but that says that any future extensions, none
yet planned, should be placed before the image archive.
Yes.
Post by Roy Bamford
The specification needs to avoid the use of relative references.
I don't understand. Could you be more specific what you expect instead?
--
Best regards,
Michał Górny
Michał Górny
2018-11-30 17:09:54 UTC
Permalink
Hi,

Here's hopefully the last update for some time (that is, before I get to
working on implementation). There are two small changes:

- clarified the text on top archive directory: mentioned it shouldn't
have an explicit member in the archive and that the implementations
should be ready to handle mismatched directory name (i.e. when archive
ends up being renamed),

- removed .txt suffix from 'gpkg-1' package identifier file.


---
GLEP: 9999
Title: Gentoo binary package container format
Author: Michał Górny <***@gentoo.org>
Type: Standards Track
Status: Draft
Version: 1
Created: 2018-11-15
Last-Modified: 2018-11-30
Post-History: 2018-11-17
Content-Type: text/x-rst
---

Abstract
========

This GLEP proposes a new binary package container format for Gentoo.
The current tbz2/XPAK format is shortly described, and its deficiences
are explained. Accordingly, the requirements for a new format are set
and a gpkg format satisfying them is proposed. The rationale for
the design decisions is provided.


Motivation
==========

The current Portage binary package format
-----------------------------------------

The historical ``.tbz2`` binary package format used by Portage is
a concatenation of two distinct formats: header-oriented compressed .tar
format (used to hold package files) and trailer-oriented custom XPAK
format (used to hold metadata) [#MAN-XPAK]_. The format has already
been extended incompatibly twice.

The first time, support for storing multiple successive builds of binary
package for a single ebuild version has been added. This feature relies
on appending additional hyphen, followed by an integer to the package
filename. It is disabled by default (preserving backwards
compatibility) and controlled by ``binpkg-multi-instance`` feature.

The second time, support for additional compression formats has been
added. When format other than bzip2 is used, the ``.tbz2`` suffix
is replaced by ``.xpak`` and Portage relies on magic bytes to detect
compression used. For backwards compatibility, Portage still defaults
to using bzip2; compression program can be switched using
``BINPKG_COMPRESS`` configuration variable.

Additionally, there have been minor changes to the stored metadata
and file storage policies. In particular, behavior regarding
``INSTALL_MASK``, controllable file compression and stripping has
changed over time.


The advantages of tbz2/XPAK format
----------------------------------

The tbz2/XPAK format used by Portage has three interesting features:

1. **Each binary package is fully contained within a single file.**
While this might seem unnecessary, it makes it easier for the user
to transfer binary packages without having to be concerned about
finding all the necessary files to transfer.

2. **The binary packages are compatible with regular compressed
tarballs, most of the time.** With notable exceptions of historical
versions of pbzip2 and the recent zstd compressor, tbz2/XPAK packages
can be extracted using regular tar utility with a compressor
implementation that discards trailing garbage.

3. **The metadata is uncompressed, and can be efficiently accessed
without decompressing package contents.** This includes
the possibility of rewriting it (e.g. as a result of package moves)
without the necessity of repacking the files.


Transparency problem with the current binary package format
-----------------------------------------------------------

Notwithstanding its advantages, the tbz2/XPAK format has a significant
design fault that consists of two issues:

1. **The XPAK format is a custom binary format with explicit use
of binary-encoded file offsets and field lengths.** As such, it is
non-trivial to read or edit without specialized tools. Such tools
are currently implemented separately from the package manager,
as part of the portage-utils toolkit, written in C [#PORTAGE-UTILS]_.

2. **The tarball compatibility feature relies on obscure feature of
ignoring trailing garbage in compressed files**. While this is
implemented consistently in most of the compressors, this feature
is not really a part of specification but rather traditional
behavior. Given that the original reasons for this no longer apply,
new compressor implementations are likely to miss support for this.

Both of the issues make the format hard to use without dedicated tools,
or when the tools misbehave. This impacts the following scenarios:

A. **Using binary packages for system recovery.** In case of serious
breakage, it is really preferable that the format depends on as few
tools a possible, and especially not on Gentoo-specific tools.

B. **Inspecting binary packages in detail exceeding standard package
manager facilities.**

C. **Modifying binary packages in ways not predicted by the package
manager authors.** A real-life example of this is working around
broken ``pkg_*`` phases which prevent the package from being
installed.


OpenPGP extensibility problem
-----------------------------

There are at least three obvious ways in which the current format could
be extended to support OpenPGP signatures, and each of them has its own
distinct problem:

1. **Adding a detached signature.** This option is non-intrusive but
causes the format to no longer be contained in a single file.

2. **Wrapping the package in OpenPGP message format.** This would use
a standard format and make verification and unpacking relatively
easy. However, it would break backwards compatibility and add
explicit dependency on OpenPGP implementation in order to unpack
the package.

3. **Adding OpenPGP signature as extra XPAK member.** This is
the clever solution. It implies strengthening the dependency
on custom tooling, now additionally necessary to extract
the signature and reconstruct the original file to accommodate
verification.


Goals for a new container format
--------------------------------

All of the above considered, the new format should combine
the advantages of the existing format and at the same time address its
deficiencies whenever possible. Furthermore, since a format replacement
is taking place it is worthwhile to consider additional goals that could
be satisfied with little change.

The following obligatory goals have been set for a replacement format:

1. **The packages must remain contained in a single file.** As a matter
of user convenience, it should be possible to transfer binary
packages without having to use multiple files, and to install them
from any location.

2. **The file format must be entirely based on common file formats,
respecting best practices, with as little customization as necessary
to satisfy the requirements.** The format should be transparent
enough to let user inspect and manipulate it without special tooling
or detailed knowledge.

3. **The file format must provide support for OpenPGP signatures.**
Preferably, it should use standard OpenPGP message formats.

4. **The file format must allow for efficient metadata updates.**
In particular, it should be possible to update the metadata without
having to recompress package files.

Additionally, the following optional goals have been noted:

A. **The file format should account for easy recognition both through
filename and through contents.** Preferably, it should have distinct
features making it possible to detect it via file(1).

B. **The file format should provide for partial fetching of binary
packages.** It should be possible to easily fetch and read
the package metadata without having to download the whole package.

C. **The file format should allow for metadata compression.**

D. **The file format should make future extensions easily possible
without breaking backwards compatibility.**


Specification
=============

The container format
--------------------

The gpkg package container is an uncompressed .tar achive whose filename
should use ``.gpkg.tar`` suffix.

The archive contains a number of files, stored in a single directory
whose name should match the basename of the package file. However,
the implementation must be able to process an archive where
the directory name is mismatched. There should be no explicit archive
member entry for the directory.

The package directory contains the following members, in order:

1. The package format identifier file ``gpkg-1`` (required).

2. A signature for the metadata archive: ``metadata.tar${comp}.sig``
(optional).

3. The metadata archive ``metadata.tar${comp}``, optionally compressed
(required).

4. A signature for the filesystem image archive:
``image.tar${comp}.sig`` (optional).

5. The filesystem image archive ``image.tar${comp}``, optionally
compressed (required).

It is recommended that relative order of the archive members is
preserved. However, implementations must support archives with members
out of order.

The container may be extended with additional members in the future.
The implementations should ignore unrecognized members and preserve
them across package updates.


Permitted .tar format features
------------------------------

The tar archives should use either the POSIX ustar format or a subset
of the GNU format with the following (optional) extensions:

- long pathnames and long linknames,

- base-256 encoding of large file sizes.

Other extensions should be avoided whenever possible.


The package identifier file
---------------------------

The package identifier file serves the purpose of identifying the binary
package format and its version.

The implementations must include a package identifier file named
``gpkg-1``. The filename includes package format version;
implementations should reject packages which do not contain this file
as unsupported format.

The file can have any contents. Normally, it should be empty.

Furthermore, this file should be included in the .tar archive
as the first member. This makes it possible to use it as an additional
magic at a fixed location that can be used by tools such as file(1)
to easily distinguish Gentoo binary packages from regular .tar archives.


The metadata archive
--------------------

The metadata archive stores the package metadata needed for the package
manager to process it. The archive should be included at the beginning
of the binary package in order to make it possible to read it out of
partially fetched binary package, and to avoid fetching the remaining
part of the package if not necessary.

The archive contains a single directory called ``metadata``. In this
directory, the individual metadata keys are stored as files. The exact
keys and metadata format is outside the scope of this specification.

The package manager may need to modify the package metadata. In this
case, it should replace the metadata archive without having to alter
other package members.

The metadata archive can optionally be compressed. It can also be
supplemented with a detached OpenPGP signature.


The image archive
-----------------

The image archive stores all the files to be installed by the binary
package. It should be included as the last of the files in the binary
package container.

The archive contains a single directory called ``image``. Inside this
directory, all package files are stored in filesystem layout, relative
to the root directory.

The image archive can optionally be compressed. It can also be
supplemented with a detached OpenPGP signature.


Archive member compression
--------------------------

The archive members outlined above support optional compression using
one of the compressed file formats supported by the package manager.
The exact list of compression types is outside the scope of this
specification.

The implementations must support archive members being uncompressed,
and must support using different compression types for different files.

When compressing an archive member, the member filename should be
suffixed using the standard suffix for the particular compressed file
type (e.g. ``.bz2`` for bzip2 format).


OpenPGP member signatures
-------------------------

The archive members support optional OpenPGP signatures.
The implementations must allow the user to specify whether OpenPGP
signatures are to be expected in remotely fetched packages.

If the signatures are expected and the archive member is unsigned, the
package manager must reject processing it. If the signature does not
verify, the package manager must reject processing the corresponding
archive member. In particular, it must not attempt decompressing
compressed members in those circumstances.

The signatures are created as binary detached OpenPGP signature files,
with filename corresponding to the member filename with ``.sig`` suffix
appended.

The exact details regarding creating and verifying signatures, as well
as maintaining and distributing keys are outside the scope of this
specification.


Rationale
=========

Package formats used by other distributions
-------------------------------------------

The research on the new package format included investigating
the possibility of reusing solutions from other operating system
distributions. While reusing a foreign package format would be
interesting, the differences in Gentoo metadata structure would prevent
any real compatibility. Some degree of compatibility might be achieved
through adapting the Gentoo metadata, however the costs of such
a solution would probably outweigh its usefulness.

Debian and its derivates are using the .deb package format. This is
a nested archive format, with the outer archive being of ar format,
and containing nested tarballs of control information (metadata)
and data [#DEB-FORMAT]_.

Red Hat, its derivates and some less related distributions are using
the RPM format. It is a custom binary format, storing metadata directly
and using a trailer cpio archive to store package files.

Arch Linux is using xz-compressed tarballs (suffixed ``.pkg.tar.xz``)
as its binary package format. The tarballs contain package files
on top-level, with specially named dotfiles used for package metadata.
OpenPGP signatures are stored as detached ``.sig`` files alongside
packages.

Exherbo is using the pbins format. In this format, the binary package
metadata is stored in repository alike ebuilds, and the binary package
files are stored separately and downloaded alike source tarballs.


Nested archive format
---------------------

The basic problem in designing the new format was how to embed multiple
data streams (metadata, image) into a single file. Traditionally, this
has been done via using two non-conflicting file formats. However,
while such a solution is clever, it suffers in terms of transparency.

Therefore, it has been established that the new format should really
consist of a single archive format, with all necessary data
transparently accessible inside the file. Consequently, it has been
debated how different parts of binary package data should be stored
inside that archive.

The proposal to continue storing image data as top-level data
in the package format, and store metadata as special directory in that
structure has been discarded as a case of in-band signalling.

Finally, the proposal has been shaped to store different kinds of data
as nested archives in the outer binary package container. Besides
providing a clean way of accessing different kinds of information, it
makes it possible to add separate OpenPGP signatures to them.


Inner vs. outer compression
---------------------------

One of the points in the new format debate was whether the binary
package as a whole should be compressed vs. compressing individual
members. The first option may seem as an obvious choice, especially
given that with a larger data set, the compression may proceed more
effectively. However, it has a single strong disadvantage: compression
prevents random access and manipulation of the binary package members.

While for the purpose of reading binary packages, the problem could be
circumvented through convenient member ordering and avoiding disjoint
reads of the binary package, metadata updates would either require
recompressing the whole package (which could be really time consuming
with large packages) or applying complex techniques such as splitting
the compressed archive into multiple compressed streams.

This considered, the simplest solution is to apply compression to
the individual package members, while leaving the container format
uncompressed. It provides fast random access to the individual members,
as well as capability of updating them without the necessity of
recompressing other files in the container.

This also makes it possible to easily protect compressed files using
standard OpenPGP detached signature format. All this combined,
the package manager may perform partial fetch of binary package, verify
the signature of its metadata member and process it without having to
fetch the potentially-large image part.


Container and archive formats
-----------------------------

During the debate, the actual archive formats to use were considered.
The .tar format seemed an obvious choice for the image archive since
it is the only widely deployed archive format that stores all kinds
of file metadata on POSIX systems. However, multiple options for
the outer format has been debated.

Firstly, the ZIP format has been proposed as the only commonly supported
format supporting adding files from stdin (i.e. making it possible to
pipe the inner archives straight into the container without using
temporary files). However, this format has been clearly rejected
as both not being present in the system set, and being trailer-based
and therefore unusable without having to fetch the whole file.

Secondly, the ar and cpio formats were considered. The former is used
by Debian and its derivative binary packages; the latter is used by Red
Hat derivatives. Both formats have the advantage of having less
historical baggage than .tar, and having less overhead. However, both
are also rather obscure (especially given that ar is actually provided
by GNU binutils rather than as a stand-alone archiver), considered
obsolete by POSIX and both have file size limitations smaller than .tar.

Thirdly, SquashFS was another interesting option. Its main advantage is
transparent compression support and ability to mount as a filesystem.
However, it has a significant implementation complexity, including mount
management and necessity of fallback to unsquashfs. Since the image
needs to be writable for the pre-installation manipulations, using it
via a mount would additionally require some kind of overlay filesystem.
Using it as top-level format has no real gain over a pipeline with tar,
and is certainly less portable. Therefore, there does not seem to be
a benefit in using SquashFS.

All that considered, it has been decided that there is no purpose
in using a second archive format in the specification unless it has
significant advantage to .tar. Therefore, .tar has also been used
as outer package format, even though it has larger overhead than other
formats (mostly due to padding).


.tar portability issues
-----------------------

The modern .tar dialects could be considered dirty extensions
of the original .tar format. Three variants may be considered
of interest: POSIX ustar, pax (newer POSIX standard) and GNU tar.
All three formats are supported by GNU tar, whose presence on systems
used to create binary packages could be relied on. Therefore,
the portability concerns are related mostly to being able to read
and modify binary packages in scenarios of GNU tar being unavailable.

For the purpose of this specification, detailed research on portability
of individual tar features has been conducted. The research concluded:

Judging by the test results, the most portability could be
achieved by:

- using strict POSIX ustar format whenever possible,

- using GNU format for long paths (that do not fix in ustar format),

- using base-256 (+ pax if already used) encoding for large files,

- using pax (+ octal or base-256) for high-range/precision
timestamps and user/group identifiers,

- using pax attributes for extended metadata and/or volume label.

It has been determined that for the purpose of binary package we really
only need to be concerned about long paths and huge files. Therefore,
the above was limited to the three first points and a guideline was
formed from them.

Debian has a similar guideline for the inner tar of their package
format [#DEB-FORMAT]_.


Member ordering
---------------

The member ordering is explicitly specified in order to provide for
trivially reading metadata from partially fetched archives.
By requiring the metadata archive to be stored before the image archive,
the package manager may stop fetching after reading it and save
bandwidth and/or space.


Detached OpenPGP signatures
---------------------------

The use of detached OpenPGP signatures is to provide authenticity checks
for binary packages. Covering the complete members with signatures
provide for trivial verification of all metadata and image contents
respectively, without having to invent custom mechanisms for combining
them. Covering the compressed archives helps to prevent zipbomb
attacks. Covering the individual members rather than the whole package
provides for verification of partially fetched binary packages.


Format versioning
-----------------

The format is versioned through an explicit file, with the version
stored in the filename. If the format changes incompatibly,
the filename changes and old implementations do not recognize it
as a valid package.

Previously, the format tried to avoid an explicit file for this purpose
and used volume label instead. However, the use of label has been
renounced due to unforeseen portability issues.


Backwards Compatibility
=======================

The format does not preserve backwards compatibility with the tbz2
packages. It has been established that preserving compatibility with
the old format was impossible without making the new format even worse
than the old one was.

For example, adding any visible members to the tarball would cause
them to be installed to the filesystem by old Portage versions. Working
around this would require some kind of awful hacks that would oppose
the goal of using simple and transparent package format.


Reference Implementation
========================

The proof-of-concept implementation of binary package format converter
is available as xpak2gpkg [#XPAK2GPKG]_. It can be used to easily
create packages in the new format for early inspection.


References
==========

.. [#MAN-XPAK] xpak - The XPAK Data Format used with Portage binary
packages
(https://dev.gentoo.org/~zmedico/portage/doc/man/xpak.5.html)

.. [#PORTAGE-UTILS] portage-utils: Small and fast Portage helper tools
written in C
(https://packages.gentoo.org/packages/app-portage/portage-utils)

.. [#DEB-FORMAT] deb(5) — Debian binary package format
(https://manpages.debian.org/unstable/dpkg-dev/deb.5.en.html)

.. [#TAR-PORTABILITY] Michał Górny, Portability of tar features
(https://dev.gentoo.org/~mgorny/articles/portability-of-tar-features.html)

.. [#XPAK2GPKG] xpak2gpkg: Proof-of-concept converter from tbz2/xpak
to gpkg binpkg format
(https://github.com/mgorny/xpak2gpkg)


Copyright
=========
This work is licensed under the Creative Commons Attribution-ShareAlike 3.0
Unported License. To view a copy of this license, visit
http://creativecommons.org/licenses/by-sa/3.0/.
--
Best regards,
Michał Górny
Ulrich Mueller
2018-12-01 10:25:00 UTC
Permalink
Post by Michał Górny
Here's hopefully the last update for some time (that is, before I get to
- clarified the text on top archive directory: mentioned it shouldn't
have an explicit member in the archive and that the implementations
should be ready to handle mismatched directory name (i.e. when archive
ends up being renamed),
- removed .txt suffix from 'gpkg-1' package identifier file.
LGTM
Post by Michał Górny
- using GNU format for long paths (that do not fix in ustar format),
s/fix/fit/

The style seems still a bit rough here and there (especially, I stumbled
over some of your uses of the perfect passive). I'll better leave that
to the native speakers on this list, though.

Ulrich

Roy Bamford
2018-11-30 21:23:22 UTC
Permalink
Post by Michał Górny
Post by Roy Bamford
Post by Michał Górny
Here's the newest version.
- added explicit notion of parent directory (missing in previous
GLEP
Post by Roy Bamford
Post by Michał Górny
but present in implementation),
- explicitly named GNU tar format with list of permitted
extensions,
Post by Roy Bamford
Post by Michał Górny
- changed volume label to 'gpkg-1.txt' file to improve
portability;
Post by Roy Bamford
Post by Michał Górny
made
it explicit version identifier as well,
- added info on other package formats to rationale.
[snip]
The image archive stores all the files to be installed by the binary
package. It should be included as the last of the files in the
binary
Post by Roy Bamford
package container.
[snip]
Post by Michał Górny
--
Best regards,
Michał Górny
Its a nit today but that says that any future extensions, none
yet planned, should be placed before the image archive.
Yes.
Post by Roy Bamford
The specification needs to avoid the use of relative references.
I don't understand. Could you be more specific what you expect
instead?
--
Best regards,
Michał Górny
Michał,

Enumerate the elements, in the preferred order, which you have
already done. The is no need, in a specification that is intended
to be easily extensible to specify that any element should be last.
That constrains extensions.

To build on an example extension given earlier. Suppose an
extension came along to add the ebuild, required eclasses and
sources. The present wording says that they should be included
before image archive.

Implementations may be capable of working with partial
downloads, why force the download of elements that may not be
required to get the payload.

The overhead of the presently define elements is small compared
to the image and its useful to be able check the metadata to
determine if the image is really what is required.

image 'last' works with the presently defined elements but may
not be so good in the years to come.

Its a subtle difference between 'last', which means always at
the end, no mater what, and 'fifth' which is last today but
might not be in the future.
--
Regards,

Roy Bamford
(Neddyseagoon) a member of
elections
gentoo-ops
forum-mods
Loading...