Some changes for format 1.2 #410

allisonkarlitskaya · 2025-01-24T14:36:56Z

This is very incomplete but it's enough to get mkcomposefs to output simple images that are identical to the images coming from the (also modified) erofs writer code in composefs-rs.

I tried to keep individual changes as individual commits so we can 👍/👎 each change independently.

We're going to want to add some more stuff around handling of alignment of symlinks and such (which no longer needs to be treated as a special case as of kernel 6.12) and probably some other points.

See #198

c1e9e93 ("writer: Reject too large files") introduced a check to ensure that we wouldn't try to write out too many null chunks for very large external files, setting an effective limit at 8PiB (which can be represented as 1024 chunks with the chunk format set to the maximum of 31 with a 4096 byte blocksize). For the purposes of this check, we don't actually care what the chunking format would be for a particular file size: we only care that it's not going to exceed the limit. We can calculate the limit directly using constants and compare against that instead. Signed-off-by: Allison Karlitskaya <[email protected]>

This is no longer used outside of the erofs writer code, so drop it from lcfs-erofs-internal.h and mark it static again in the .c file. Signed-off-by: Allison Karlitskaya <[email protected]>

Introduce "version 2" composefs format by bumping the maximum version. We intend to use this version to break with some decisions that were taken for historical reasons and to take advantage of some features that are only available in newer kernel versions. There are no changes introduced yet, but the following commits will add them. Signed-off-by: Allison Karlitskaya <[email protected]>

Modify erofs_compute_chunking() to take the lcfs_ctx_s pointer (and also compute_erofs_inode_size(), which calls it). If we're building a version 1.2 composefs then hardcode the chunk format to the maximum value. Signed-off-by: Allison Karlitskaya <[email protected]>

If we're writing a version 1.2 composefs then only write extended format inodes. Justification: composefs was originally imagined for filesystems which contain many/most/all files with the same timestamp, which is no longer how we intend to use it. Any files that have a different timestamp than the "build time" stored in the superblock will need to be stored in the extended format anyway, which means that very few compact inodes are ever written per filesystem (maybe only one). Let's just use extended everywhere to simplify things. Signed-off-by: Allison Karlitskaya <[email protected]>

We write a 'i_ino' field into the erofs inode structs for compatibility with 32bit systems or 32bit applications running on 64bit systems. This is done because the 'nid' that erofs uses for the inode ID on 64bit might overflow a 32bit integer. In order for that to happen, though, the inode table alone would have to grow past 128GiB in size: since inode alignment is 32 bytes, the nid *with a maximum value of 4Gi) is multiplied by 32B to determine the file offset of the inode. This seems exceedingly unlikely to ever occur. composefs images for even very large systems are on the order of less than 100MB of total size, with the inode table even smaller than that. Let's just set the 32bit inode value the same as the 64bit value. This also avoids another potential issue: under the old approach, we use 0 as the 32bit value for the root directory inode (which always comes first). Some parts of the kernel (`grep is_zero_ino`, for example) claim that userspace might depend on inodes being non-zero, and all other filesystems seem to respect this. nids are always non-zero (since an inode at the start of the file would overlap the header). Signed-off-by: Allison Karlitskaya <[email protected]>

Since the kernel added support for data-only overlayfs layers, the whiteouts in the root directory are unnecessary. Stop including them. Signed-off-by: Allison Karlitskaya <[email protected]>

Since we no longer make use of compact inodes, setting the creation time on the superblock is not useful: nothing else makes use of it. Leave it as zero. Signed-off-by: Allison Karlitskaya <[email protected]>

In erofs, finding a shared xattr involves taking the 32bit id, multiplying it by 4, and using that as an index into the section of the file starting at the block named for shared xattr data in the superblock (superblock.xattr_blkaddr). This is only useful of the inode and xattr data stored in the filesystem would exceed 16GiB, which is exceedingly unlikely to ever occur for a composefs. We can make our lives a fair bit easier if we set the xattr starting block to 0 and simply store the file offset (divded by 4) as the xattr id directly. There is a similar pointer (superblock.meta_blkaddr) for the start of the inodes. This calculation only involves constants and always works out to zero. We could delete that entirely, but in any case, let's not bother with it for version 2. Signed-off-by: Allison Karlitskaya <[email protected]>

cgwalters · 2025-01-24T14:50:38Z

libcomposefs/lcfs-writer.h

@@ -37,10 +37,11 @@ enum lcfs_flags_t {
 	LCFS_FLAGS_MASK = 0,
 };

-#define LCFS_VERSION_MAX 1
+#define LCFS_VERSION_MAX 2


Mechanically I think it'd be helpful for us to add the concept of an "experimental" version. And we probably only want at most one of those at a time.

So how about
#define LCFS_VERSION_EXPERIMENTAL 2

And on the CLI it's mkcomposefs --format-version=experimental and in the C API maybe we can at least require them to set something like acknowledge_experimental: bool in struct lcfs_write_options_s?

Basically if we just clearly flag this as experimental I don't see any problems in merging to main and iterating from there, right?

That makes sense. I think I'd propose a separate --experimental flag in that case, though, since it would likely be easier to implement (since the current version values are expected to be integers).

Sure, SGTM. (bikeshed: --experimental --format-version=2? i.e. --experimental just unlocks all "experimental features" of which is is the only instance currently, and sets the "acknowledge experimental" bit in the struct lcfs_write_options_s ? And --format-version=2 is an error without --experimental set)

I just thought about a potential annoying issue here (although I think it's an issue we'll have anyway): the composefs-rs testsuite calls mkcomposefs (soon with --experimental) to compare that the outputs are the same. If we expect --experimental to shift meaning in the future then this could get annoying....

I think we should also add mkcomposefs --version (on general principle) and have it output in yaml format (docker, podman version, ostree do this):

mkcomposefs: version: 1.0.7 experimental-format: 2.0 # like a sub-version in the experimental format

Then the test suite can at least cover the installed mkcomposefs's version; we basically skip in the unit/integration tests unsupported versions. In CI (i.e. GHA) for composefs-rs we'd install a pinned version of composefs by default etc.

The problem we have is that we have a goal in mind, where we have a series of increasing version numbers, that have well defined, non-changing meaning. But we also want that to be able to merge partial work on new features that will then eventually combine into the new fully supported version.

So, --experimental would be an opt-in to getting a potentially partial new version. However, I don't really like that we'd have a switch that produces output that varies over time.

Can we maybe introduce a new version number for the currently in-progress "next" version. Say, lets call the "next stable version" 2, and we give the current, in-tree experimental version for it "1". To build such an image you'd have to put in --format-version=2 --experimental-version=1. Whenever we make an unstable version format change we bump this, and now users have to pass --format-version=2 --experimental-version=2. Until eventually the version is stable. (Note: We would only support the latest experimental version, others would just fail).

Or, to be less painful, maybe we can call it "--format-version=2beta1", which we then bump to "--format-version=2beta2", and eventually "2", when it is stablilized.

cgwalters · 2025-01-24T14:59:35Z

Also on mechanics, I think we'll need to ensure the integration tests cover all format versions, and the fuzzer path. The fuzzer one in particular adds a lot of coverage nowadays as it ensures we can also read back what we wrote.

With this change, we create the erofs filesystem for ourselves. instead of using the external mkcomposefs CLI. This currently produces a different output than the output of mkcomposefs. We're planning to make changes here to make the output more similar to mkcomposefs while pursuing changes in mkcomposefs to make it more similar to the new format. This work is being tracked as an issue in containers/composefs#198 and a first pull request in containers/composefs#410. Closes containers#56 Signed-off-by: Allison Karlitskaya <[email protected]>

With this change, we create the erofs filesystem for ourselves, instead of using the external mkcomposefs CLI. This currently produces a different output than the output of mkcomposefs. We're planning to make changes here to make the output more similar to mkcomposefs while pursuing changes in mkcomposefs to make it more similar to our format. This work is being tracked as an issue in containers/composefs#198 and a first pull request in containers/composefs#410 . Closes containers#56 Signed-off-by: Allison Karlitskaya <[email protected]>

With this change, we create the erofs filesystem for ourselves, instead of using the external mkcomposefs CLI. This currently produces a different output than the output of mkcomposefs. We're planning to make changes here to make the output more similar to mkcomposefs while pursuing changes in mkcomposefs to make it more similar to our format. This work is being tracked as an issue in containers/composefs#198 and a first pull request in containers/composefs#410 . Closes #56 Signed-off-by: Allison Karlitskaya <[email protected]>

alexlarsson · 2025-02-10T12:52:54Z

libcomposefs/lcfs-writer-erofs.c

+				uint64_t offset = ctx_erofs->inodes_end;
+				if (ctx->options->version < 2)
+					offset %= EROFS_BLKSIZ;
+				offset += xattr->erofs_shared_xattr_offset;


I'm not particularly fond of how this conditional is done. It makes it quite complicated to understand what is going on. I think a more plain if (version < 2) { do it in old way} else { do it in new way } would be easier to understand.

alexlarsson · 2025-02-10T12:54:27Z

I gotta say that I'm not overly sold that this small set of changes make any real drastic difference to the file format, but I'm also not against any individual change.

allisonkarlitskaya · 2025-02-19T06:01:33Z

These changes aren't about "big wins in terms of efficiency" or anything like that, but more about "reducing the complexity of the implementation" with an eye towards being able to succinctly document the process of creating a composefs image in a way that a 3rd party implementation would have a fighting chance of being bit-for-bit compatible with the existing writer implementations.

allisonkarlitskaya added 9 commits January 24, 2025 14:53

writer: make erofs_compute_chunking() static again

4650488

This is no longer used outside of the erofs writer code, so drop it from lcfs-erofs-internal.h and mark it static again in the .c file. Signed-off-by: Allison Karlitskaya <[email protected]>

writer: don't write 00-ff whiteouts in version 2

0d91579

Since the kernel added support for data-only overlayfs layers, the whiteouts in the root directory are unnecessary. Stop including them. Signed-off-by: Allison Karlitskaya <[email protected]>

writer: don't set creation time on version 2

ae44269

Since we no longer make use of compact inodes, setting the creation time on the superblock is not useful: nothing else makes use of it. Leave it as zero. Signed-off-by: Allison Karlitskaya <[email protected]>

allisonkarlitskaya requested a review from cgwalters January 24, 2025 14:36

cgwalters reviewed Jan 24, 2025

View reviewed changes

alexlarsson reviewed Feb 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some changes for format 1.2 #410

Some changes for format 1.2 #410

allisonkarlitskaya commented Jan 24, 2025

cgwalters Jan 24, 2025

allisonkarlitskaya Jan 24, 2025

cgwalters Jan 24, 2025

allisonkarlitskaya Jan 24, 2025

cgwalters Jan 24, 2025

alexlarsson Feb 10, 2025

alexlarsson Feb 10, 2025

cgwalters commented Jan 24, 2025

alexlarsson Feb 10, 2025

alexlarsson commented Feb 10, 2025

allisonkarlitskaya commented Feb 19, 2025

Some changes for format 1.2 #410

Are you sure you want to change the base?

Some changes for format 1.2 #410

Conversation

allisonkarlitskaya commented Jan 24, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cgwalters commented Jan 24, 2025

Choose a reason for hiding this comment

alexlarsson commented Feb 10, 2025

allisonkarlitskaya commented Feb 19, 2025