Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some changes for format 1.2 #410

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

allisonkarlitskaya
Copy link
Collaborator

This is very incomplete but it's enough to get mkcomposefs to output simple images that are identical to the images coming from the (also modified) erofs writer code in composefs-rs.

I tried to keep individual changes as individual commits so we can 👍/👎 each change independently.

We're going to want to add some more stuff around handling of alignment of symlinks and such (which no longer needs to be treated as a special case as of kernel 6.12) and probably some other points.

See #198

c1e9e93 ("writer: Reject too large files") introduced a check to
ensure that we wouldn't try to write out too many null chunks for very
large external files, setting an effective limit at 8PiB (which can be
represented as 1024 chunks with the chunk format set to the maximum of
31 with a 4096 byte blocksize).

For the purposes of this check, we don't actually care what the chunking
format would be for a particular file size: we only care that it's not
going to exceed the limit.  We can calculate the limit directly using
constants and compare against that instead.

Signed-off-by: Allison Karlitskaya <[email protected]>
This is no longer used outside of the erofs writer code, so drop it from
lcfs-erofs-internal.h and mark it static again in the .c file.

Signed-off-by: Allison Karlitskaya <[email protected]>
Introduce "version 2" composefs format by bumping the maximum version.
We intend to use this version to break with some decisions that were
taken for historical reasons and to take advantage of some features that
are only available in newer kernel versions.

There are no changes introduced yet, but the following commits will add
them.

Signed-off-by: Allison Karlitskaya <[email protected]>
Modify erofs_compute_chunking() to take the lcfs_ctx_s pointer (and also
compute_erofs_inode_size(), which calls it).  If we're building a
version 1.2 composefs then hardcode the chunk format to the maximum
value.

Signed-off-by: Allison Karlitskaya <[email protected]>
If we're writing a version 1.2 composefs then only write extended format
inodes.

Justification: composefs was originally imagined for filesystems which
contain many/most/all files with the same timestamp, which is no longer
how we intend to use it.  Any files that have a different timestamp than
the "build time" stored in the superblock will need to be stored in the
extended format anyway, which means that very few compact inodes are
ever written per filesystem (maybe only one).  Let's just use extended
everywhere to simplify things.

Signed-off-by: Allison Karlitskaya <[email protected]>
We write a 'i_ino' field into the erofs inode structs for compatibility
with 32bit systems or 32bit applications running on 64bit systems.  This
is done because the 'nid' that erofs uses for the inode ID on 64bit
might overflow a 32bit integer.

In order for that to happen, though, the inode table alone would have to
grow past 128GiB in size: since inode alignment is 32 bytes, the nid
*with a maximum value of 4Gi) is multiplied by 32B to determine the file
offset of the inode.

This seems exceedingly unlikely to ever occur.  composefs images for
even very large systems are on the order of less than 100MB of total
size, with the inode table even smaller than that.

Let's just set the 32bit inode value the same as the 64bit value.

This also avoids another potential issue: under the old approach, we use
0 as the 32bit value for the root directory inode (which always comes
first).  Some parts of the kernel (`grep is_zero_ino`, for example)
claim that userspace might depend on inodes being non-zero, and all
other filesystems seem to respect this.  nids are always non-zero (since
an inode at the start of the file would overlap the header).

Signed-off-by: Allison Karlitskaya <[email protected]>
Since the kernel added support for data-only overlayfs layers, the
whiteouts in the root directory are unnecessary.  Stop including them.

Signed-off-by: Allison Karlitskaya <[email protected]>
Since we no longer make use of compact inodes, setting the creation
time on the superblock is not useful: nothing else makes use of it.
Leave it as zero.

Signed-off-by: Allison Karlitskaya <[email protected]>
In erofs, finding a shared xattr involves taking the 32bit id,
multiplying it by 4, and using that as an index into the section of the
file starting at the block named for shared xattr data in the
superblock (superblock.xattr_blkaddr).

This is only useful of the inode and xattr data stored in the filesystem
would exceed 16GiB, which is exceedingly unlikely to ever occur for a
composefs.

We can make our lives a fair bit easier if we set the xattr starting
block to 0 and simply store the file offset (divded by 4) as the xattr
id directly.

There is a similar pointer (superblock.meta_blkaddr) for the start of
the inodes.  This calculation only involves constants and always works
out to zero.  We could delete that entirely, but in any case, let's not
bother with it for version 2.

Signed-off-by: Allison Karlitskaya <[email protected]>
@@ -37,10 +37,11 @@ enum lcfs_flags_t {
LCFS_FLAGS_MASK = 0,
};

#define LCFS_VERSION_MAX 1
#define LCFS_VERSION_MAX 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mechanically I think it'd be helpful for us to add the concept of an "experimental" version. And we probably only want at most one of those at a time.

So how about
#define LCFS_VERSION_EXPERIMENTAL 2

And on the CLI it's mkcomposefs --format-version=experimental and in the C API maybe we can at least require them to set something like acknowledge_experimental: bool in struct lcfs_write_options_s?

Basically if we just clearly flag this as experimental I don't see any problems in merging to main and iterating from there, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I think I'd propose a separate --experimental flag in that case, though, since it would likely be easier to implement (since the current version values are expected to be integers).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, SGTM. (bikeshed: --experimental --format-version=2? i.e. --experimental just unlocks all "experimental features" of which is is the only instance currently, and sets the "acknowledge experimental" bit in the struct lcfs_write_options_s ? And --format-version=2 is an error without --experimental set)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just thought about a potential annoying issue here (although I think it's an issue we'll have anyway): the composefs-rs testsuite calls mkcomposefs (soon with --experimental) to compare that the outputs are the same. If we expect --experimental to shift meaning in the future then this could get annoying....

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also add mkcomposefs --version (on general principle) and have it output in yaml format (docker, podman version, ostree do this):

mkcomposefs:
  version: 1.0.7
  experimental-format: 2.0 # like a sub-version in the experimental format

Then the test suite can at least cover the installed mkcomposefs's version; we basically skip in the unit/integration tests unsupported versions. In CI (i.e. GHA) for composefs-rs we'd install a pinned version of composefs by default etc.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem we have is that we have a goal in mind, where we have a series of increasing version numbers, that have well defined, non-changing meaning. But we also want that to be able to merge partial work on new features that will then eventually combine into the new fully supported version.

So, --experimental would be an opt-in to getting a potentially partial new version. However, I don't really like that we'd have a switch that produces output that varies over time.

Can we maybe introduce a new version number for the currently in-progress "next" version. Say, lets call the "next stable version" 2, and we give the current, in-tree experimental version for it "1". To build such an image you'd have to put in --format-version=2 --experimental-version=1. Whenever we make an unstable version format change we bump this, and now users have to pass --format-version=2 --experimental-version=2. Until eventually the version is stable. (Note: We would only support the latest experimental version, others would just fail).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, to be less painful, maybe we can call it "--format-version=2beta1", which we then bump to "--format-version=2beta2", and eventually "2", when it is stablilized.

@cgwalters
Copy link
Contributor

Also on mechanics, I think we'll need to ensure the integration tests cover all format versions, and the fuzzer path. The fuzzer one in particular adds a lot of coverage nowadays as it ensures we can also read back what we wrote.

allisonkarlitskaya added a commit to allisonkarlitskaya/composefs-rs that referenced this pull request Jan 29, 2025
With this change, we create the erofs filesystem for ourselves.
instead of using the external mkcomposefs CLI.

This currently produces a different output than the output of
mkcomposefs.  We're planning to make changes here to make the output
more similar to mkcomposefs while pursuing changes in mkcomposefs to
make it more similar to the new format.  This work is being tracked as
an issue in containers/composefs#198 and a
first pull request in containers/composefs#410.

Closes containers#56

Signed-off-by: Allison Karlitskaya <[email protected]>
allisonkarlitskaya added a commit to allisonkarlitskaya/composefs-rs that referenced this pull request Jan 29, 2025
With this change, we create the erofs filesystem for ourselves.
instead of using the external mkcomposefs CLI.

This currently produces a different output than the output of
mkcomposefs.  We're planning to make changes here to make the output
more similar to mkcomposefs while pursuing changes in mkcomposefs to
make it more similar to the new format.  This work is being tracked as
an issue in containers/composefs#198 and a
first pull request in containers/composefs#410.

Closes containers#56

Signed-off-by: Allison Karlitskaya <[email protected]>
allisonkarlitskaya added a commit to allisonkarlitskaya/composefs-rs that referenced this pull request Jan 30, 2025
With this change, we create the erofs filesystem for ourselves, instead
of using the external mkcomposefs CLI.

This currently produces a different output than the output of
mkcomposefs.  We're planning to make changes here to make the output
more similar to mkcomposefs while pursuing changes in mkcomposefs to
make it more similar to our format.  This work is being tracked as an
issue in containers/composefs#198 and a first
pull request in containers/composefs#410 .

Closes containers#56

Signed-off-by: Allison Karlitskaya <[email protected]>
allisonkarlitskaya added a commit to allisonkarlitskaya/composefs-rs that referenced this pull request Jan 30, 2025
With this change, we create the erofs filesystem for ourselves, instead
of using the external mkcomposefs CLI.

This currently produces a different output than the output of
mkcomposefs.  We're planning to make changes here to make the output
more similar to mkcomposefs while pursuing changes in mkcomposefs to
make it more similar to our format.  This work is being tracked as an
issue in containers/composefs#198 and a first
pull request in containers/composefs#410 .

Closes containers#56

Signed-off-by: Allison Karlitskaya <[email protected]>
allisonkarlitskaya added a commit to allisonkarlitskaya/composefs-rs that referenced this pull request Jan 30, 2025
With this change, we create the erofs filesystem for ourselves, instead
of using the external mkcomposefs CLI.

This currently produces a different output than the output of
mkcomposefs.  We're planning to make changes here to make the output
more similar to mkcomposefs while pursuing changes in mkcomposefs to
make it more similar to our format.  This work is being tracked as an
issue in containers/composefs#198 and a first
pull request in containers/composefs#410 .

Closes containers#56

Signed-off-by: Allison Karlitskaya <[email protected]>
allisonkarlitskaya added a commit to containers/composefs-rs that referenced this pull request Jan 30, 2025
With this change, we create the erofs filesystem for ourselves, instead
of using the external mkcomposefs CLI.

This currently produces a different output than the output of
mkcomposefs.  We're planning to make changes here to make the output
more similar to mkcomposefs while pursuing changes in mkcomposefs to
make it more similar to our format.  This work is being tracked as an
issue in containers/composefs#198 and a first
pull request in containers/composefs#410 .

Closes #56

Signed-off-by: Allison Karlitskaya <[email protected]>
uint64_t offset = ctx_erofs->inodes_end;
if (ctx->options->version < 2)
offset %= EROFS_BLKSIZ;
offset += xattr->erofs_shared_xattr_offset;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not particularly fond of how this conditional is done. It makes it quite complicated to understand what is going on. I think a more plain if (version < 2) { do it in old way} else { do it in new way } would be easier to understand.

@alexlarsson
Copy link
Collaborator

I gotta say that I'm not overly sold that this small set of changes make any real drastic difference to the file format, but I'm also not against any individual change.

@allisonkarlitskaya
Copy link
Collaborator Author

These changes aren't about "big wins in terms of efficiency" or anything like that, but more about "reducing the complexity of the implementation" with an eye towards being able to succinctly document the process of creating a composefs image in a way that a 3rd party implementation would have a fighting chance of being bit-for-bit compatible with the existing writer implementations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants