-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some changes for format 1.2 #410
base: main
Are you sure you want to change the base?
Conversation
c1e9e93 ("writer: Reject too large files") introduced a check to ensure that we wouldn't try to write out too many null chunks for very large external files, setting an effective limit at 8PiB (which can be represented as 1024 chunks with the chunk format set to the maximum of 31 with a 4096 byte blocksize). For the purposes of this check, we don't actually care what the chunking format would be for a particular file size: we only care that it's not going to exceed the limit. We can calculate the limit directly using constants and compare against that instead. Signed-off-by: Allison Karlitskaya <[email protected]>
This is no longer used outside of the erofs writer code, so drop it from lcfs-erofs-internal.h and mark it static again in the .c file. Signed-off-by: Allison Karlitskaya <[email protected]>
Introduce "version 2" composefs format by bumping the maximum version. We intend to use this version to break with some decisions that were taken for historical reasons and to take advantage of some features that are only available in newer kernel versions. There are no changes introduced yet, but the following commits will add them. Signed-off-by: Allison Karlitskaya <[email protected]>
Modify erofs_compute_chunking() to take the lcfs_ctx_s pointer (and also compute_erofs_inode_size(), which calls it). If we're building a version 1.2 composefs then hardcode the chunk format to the maximum value. Signed-off-by: Allison Karlitskaya <[email protected]>
If we're writing a version 1.2 composefs then only write extended format inodes. Justification: composefs was originally imagined for filesystems which contain many/most/all files with the same timestamp, which is no longer how we intend to use it. Any files that have a different timestamp than the "build time" stored in the superblock will need to be stored in the extended format anyway, which means that very few compact inodes are ever written per filesystem (maybe only one). Let's just use extended everywhere to simplify things. Signed-off-by: Allison Karlitskaya <[email protected]>
We write a 'i_ino' field into the erofs inode structs for compatibility with 32bit systems or 32bit applications running on 64bit systems. This is done because the 'nid' that erofs uses for the inode ID on 64bit might overflow a 32bit integer. In order for that to happen, though, the inode table alone would have to grow past 128GiB in size: since inode alignment is 32 bytes, the nid *with a maximum value of 4Gi) is multiplied by 32B to determine the file offset of the inode. This seems exceedingly unlikely to ever occur. composefs images for even very large systems are on the order of less than 100MB of total size, with the inode table even smaller than that. Let's just set the 32bit inode value the same as the 64bit value. This also avoids another potential issue: under the old approach, we use 0 as the 32bit value for the root directory inode (which always comes first). Some parts of the kernel (`grep is_zero_ino`, for example) claim that userspace might depend on inodes being non-zero, and all other filesystems seem to respect this. nids are always non-zero (since an inode at the start of the file would overlap the header). Signed-off-by: Allison Karlitskaya <[email protected]>
Since the kernel added support for data-only overlayfs layers, the whiteouts in the root directory are unnecessary. Stop including them. Signed-off-by: Allison Karlitskaya <[email protected]>
Since we no longer make use of compact inodes, setting the creation time on the superblock is not useful: nothing else makes use of it. Leave it as zero. Signed-off-by: Allison Karlitskaya <[email protected]>
In erofs, finding a shared xattr involves taking the 32bit id, multiplying it by 4, and using that as an index into the section of the file starting at the block named for shared xattr data in the superblock (superblock.xattr_blkaddr). This is only useful of the inode and xattr data stored in the filesystem would exceed 16GiB, which is exceedingly unlikely to ever occur for a composefs. We can make our lives a fair bit easier if we set the xattr starting block to 0 and simply store the file offset (divded by 4) as the xattr id directly. There is a similar pointer (superblock.meta_blkaddr) for the start of the inodes. This calculation only involves constants and always works out to zero. We could delete that entirely, but in any case, let's not bother with it for version 2. Signed-off-by: Allison Karlitskaya <[email protected]>
@@ -37,10 +37,11 @@ enum lcfs_flags_t { | |||
LCFS_FLAGS_MASK = 0, | |||
}; | |||
|
|||
#define LCFS_VERSION_MAX 1 | |||
#define LCFS_VERSION_MAX 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mechanically I think it'd be helpful for us to add the concept of an "experimental" version. And we probably only want at most one of those at a time.
So how about
#define LCFS_VERSION_EXPERIMENTAL 2
And on the CLI it's mkcomposefs --format-version=experimental
and in the C API maybe we can at least require them to set something like acknowledge_experimental: bool
in struct lcfs_write_options_s
?
Basically if we just clearly flag this as experimental I don't see any problems in merging to main and iterating from there, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. I think I'd propose a separate --experimental
flag in that case, though, since it would likely be easier to implement (since the current version values are expected to be integers).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, SGTM. (bikeshed: --experimental --format-version=2
? i.e. --experimental
just unlocks all "experimental features" of which is is the only instance currently, and sets the "acknowledge experimental" bit in the struct lcfs_write_options_s ? And --format-version=2
is an error without --experimental
set)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just thought about a potential annoying issue here (although I think it's an issue we'll have anyway): the composefs-rs testsuite calls mkcomposefs
(soon with --experimental
) to compare that the outputs are the same. If we expect --experimental
to shift meaning in the future then this could get annoying....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should also add mkcomposefs --version
(on general principle) and have it output in yaml format (docker, podman version
, ostree do this):
mkcomposefs:
version: 1.0.7
experimental-format: 2.0 # like a sub-version in the experimental format
Then the test suite can at least cover the installed mkcomposefs's version; we basically skip in the unit/integration tests unsupported versions. In CI (i.e. GHA) for composefs-rs we'd install a pinned version of composefs by default etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem we have is that we have a goal in mind, where we have a series of increasing version numbers, that have well defined, non-changing meaning. But we also want that to be able to merge partial work on new features that will then eventually combine into the new fully supported version.
So, --experimental would be an opt-in to getting a potentially partial new version. However, I don't really like that we'd have a switch that produces output that varies over time.
Can we maybe introduce a new version number for the currently in-progress "next" version. Say, lets call the "next stable version" 2, and we give the current, in-tree experimental version for it "1". To build such an image you'd have to put in --format-version=2 --experimental-version=1. Whenever we make an unstable version format change we bump this, and now users have to pass --format-version=2 --experimental-version=2. Until eventually the version is stable. (Note: We would only support the latest experimental version, others would just fail).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, to be less painful, maybe we can call it "--format-version=2beta1", which we then bump to "--format-version=2beta2", and eventually "2", when it is stablilized.
Also on mechanics, I think we'll need to ensure the integration tests cover all format versions, and the fuzzer path. The fuzzer one in particular adds a lot of coverage nowadays as it ensures we can also read back what we wrote. |
With this change, we create the erofs filesystem for ourselves. instead of using the external mkcomposefs CLI. This currently produces a different output than the output of mkcomposefs. We're planning to make changes here to make the output more similar to mkcomposefs while pursuing changes in mkcomposefs to make it more similar to the new format. This work is being tracked as an issue in containers/composefs#198 and a first pull request in containers/composefs#410. Closes containers#56 Signed-off-by: Allison Karlitskaya <[email protected]>
With this change, we create the erofs filesystem for ourselves. instead of using the external mkcomposefs CLI. This currently produces a different output than the output of mkcomposefs. We're planning to make changes here to make the output more similar to mkcomposefs while pursuing changes in mkcomposefs to make it more similar to the new format. This work is being tracked as an issue in containers/composefs#198 and a first pull request in containers/composefs#410. Closes containers#56 Signed-off-by: Allison Karlitskaya <[email protected]>
With this change, we create the erofs filesystem for ourselves, instead of using the external mkcomposefs CLI. This currently produces a different output than the output of mkcomposefs. We're planning to make changes here to make the output more similar to mkcomposefs while pursuing changes in mkcomposefs to make it more similar to our format. This work is being tracked as an issue in containers/composefs#198 and a first pull request in containers/composefs#410 . Closes containers#56 Signed-off-by: Allison Karlitskaya <[email protected]>
With this change, we create the erofs filesystem for ourselves, instead of using the external mkcomposefs CLI. This currently produces a different output than the output of mkcomposefs. We're planning to make changes here to make the output more similar to mkcomposefs while pursuing changes in mkcomposefs to make it more similar to our format. This work is being tracked as an issue in containers/composefs#198 and a first pull request in containers/composefs#410 . Closes containers#56 Signed-off-by: Allison Karlitskaya <[email protected]>
With this change, we create the erofs filesystem for ourselves, instead of using the external mkcomposefs CLI. This currently produces a different output than the output of mkcomposefs. We're planning to make changes here to make the output more similar to mkcomposefs while pursuing changes in mkcomposefs to make it more similar to our format. This work is being tracked as an issue in containers/composefs#198 and a first pull request in containers/composefs#410 . Closes containers#56 Signed-off-by: Allison Karlitskaya <[email protected]>
With this change, we create the erofs filesystem for ourselves, instead of using the external mkcomposefs CLI. This currently produces a different output than the output of mkcomposefs. We're planning to make changes here to make the output more similar to mkcomposefs while pursuing changes in mkcomposefs to make it more similar to our format. This work is being tracked as an issue in containers/composefs#198 and a first pull request in containers/composefs#410 . Closes #56 Signed-off-by: Allison Karlitskaya <[email protected]>
uint64_t offset = ctx_erofs->inodes_end; | ||
if (ctx->options->version < 2) | ||
offset %= EROFS_BLKSIZ; | ||
offset += xattr->erofs_shared_xattr_offset; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not particularly fond of how this conditional is done. It makes it quite complicated to understand what is going on. I think a more plain if (version < 2) { do it in old way} else { do it in new way } would be easier to understand.
I gotta say that I'm not overly sold that this small set of changes make any real drastic difference to the file format, but I'm also not against any individual change. |
These changes aren't about "big wins in terms of efficiency" or anything like that, but more about "reducing the complexity of the implementation" with an eye towards being able to succinctly document the process of creating a composefs image in a way that a 3rd party implementation would have a fighting chance of being bit-for-bit compatible with the existing writer implementations. |
This is very incomplete but it's enough to get mkcomposefs to output simple images that are identical to the images coming from the (also modified) erofs writer code in composefs-rs.
I tried to keep individual changes as individual commits so we can 👍/👎 each change independently.
We're going to want to add some more stuff around handling of alignment of symlinks and such (which no longer needs to be treated as a special case as of kernel 6.12) and probably some other points.
See #198