Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: download browsers as TAR #34033

Open
wants to merge 36 commits into
base: main
Choose a base branch
from

Conversation

Skn0tt
Copy link
Member

@Skn0tt Skn0tt commented Dec 16, 2024

Some of our browsers are already available as .tar.br. Compared to the current .zip archives, the brotli tarballs are ~10-30% smaller. This PR makes us download brotli files for chromium and webkit.

@Skn0tt Skn0tt self-assigned this Dec 16, 2024

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@Skn0tt Skn0tt marked this pull request as ready for review December 18, 2024 14:28

This comment has been minimized.

@@ -10,6 +10,7 @@
},
"dependencies": {
"extract-zip": "2.0.1",
"tar-fs": "^3.0.6",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the library is popular, but its deps list seem to be excessive for what it does a little. Did we consider alternatives?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow, "tar" is even more...

Copy link
Member Author

@Skn0tt Skn0tt Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we considered tar-fs, tar and writing our own. Writing our own turned out more complex than imagined, because webkit has very long path names and the format becomes tricky when that's involved. Of the three, tar-fs seemed the most focused.

@@ -48,8 +48,8 @@
"revision": "1011",
"installByDefault": true,
"revisionOverrides": {
"mac12": "1010",
"mac12-arm64": "1010"
"mac12": "1011",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whats the motivation for changing this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1010 doesn't have .tar.br, and 1010 is identical to 1011 in functionality

This comment has been minimized.

@@ -1229,6 +2813,6 @@ END OF [email protected] AND INFORMATION

SUMMARY BEGIN HERE
=========================================
Total Packages: 48
Total Packages: 60
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are paying a 25% bump in # of deps for a feature that does not link to a user report linked. Usually not a very good sign.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. It also increases the zip bundle size from 112kb to 202kb. We had this attempt of writing our own tar parser, maybe we should give it another try.

@pavelfeldman
Copy link
Member

Just as an idea, can we utilize extract-zip's non-compression mode for tar? That way we use zip for tar and don't need all this extra code? i.e. the files will be .zip.br, not .tar.br.

@Skn0tt
Copy link
Member Author

Skn0tt commented Jan 6, 2025

Just as an idea, can we utilize extract-zip's non-compression mode for tar?

That'd save some dependencies, but would result in slightly larger bundles1 and it'd prevent streaming extraction. I'd prefer to stick with TAR, gonna take a stab at reducing the bundle size for that.

Footnotes

  1. tested on firefox-mac: .zip is 92mb, .zip.br is 67.2mb and .tar.br is 65.36mb

if (!downloadPathTemplate)
return [];
// old webkit versions don't have brotli
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious why only old webkit revisions don't have brotli

Copy link
Member Author

@Skn0tt Skn0tt Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

webkit is the only browser we have revisionOverrides overrides for that point to old versions, so the CI script that created them didn't yet create brotli

}
log(`SUCCESS downloading and extracting ${options.title}`);
} else {
await downloadFile(options);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm seeing different error handling code in this branch, including explicit checks for ECONNRESET. Is walking away from them intended? Should we do both changes at a time? I'd be more comfortable with leaving the download code as is and swapping piping into file with piping into broti.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new branch is intended to be as similar as possible, while also making the code a little more linear. The ECONNRESET check only changed the error message, so I didn't include that.

Let me see if I can refactor it to make the change less spooky.

Copy link
Member Author

@Skn0tt Skn0tt Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've refactored it so we can reuse the existing download function. Good pointer, thanks!

@@ -0,0 +1 @@
This directory contains a modified copy of the `tar-stream` library that's used exclusively to extract TAR files.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure all the third party files are under the third_party folder and corresponding license files are provided beside the files. Make sure they end up in third party list or in a distributed bundle

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! See diff.patch for all my changes.

}

shiftFirst (size) {
return this._buffered === 0 ? null : this._next(size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a bug on 21st line of this library? (I don't see this._buffered defined)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -0,0 +1,311 @@
const { Writable, Readable, getStreamError } = require('stream')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getStreamError is not a thing. How is it supposed to work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find! Removed the usage of it by moving _predestroy into _destroy. Once I add the diff this will make more sense.

const len = parseInt(buf.toString('ascii', 0, i), 10)
if (!len) return result

const b = buf.subarray('ascii', i + 1, len - 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ChatGPT thinks it is a bug since the value called len is used in the subarray(start, end) signature. Given that the start is i + 1, which points to right after the parsed len, len - 1 can't be a valid end, did they want to say i + len here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be some sort of special case checking. Maybe in some implementations of TAR/PAX, len doesn't contain a length, but an index?

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

} else {
const file = fs.createWriteStream(options.zipPath);
await downloadFile(options, file);
log(`SUCCESS downloading ${options.title}`);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we place it before downloadFile and print "log(downloading ${options.title});"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to do that in a separate PR, but that would be a behaviour change to before. What'd be the benefit?

return decodeStr(buf, 0, buf.length, encoding)
}

exports.encodePax = function encodePax (opts) { // TODO: encode more stuff in pax
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's drop this too, we only need to decode?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Will do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the base for this patch? I see that headers.js upstream start with https://github.com/mafintosh/tar-stream/blob/126968fd3c4a39eba5f8318c255e04cedbbad176/headers.js#L1-L4 but I don't see those lines removed in the diff, was there some other modification before? Maybe we can split the patch into removing unnecessary stuff and then applying some changes to avoid third-party deps?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like headers.js somehow wasn't included in the diff 🤔 Let me fix that.

const len = parseInt(buf.toString('ascii', 0, i), 10)
if (!len) return result

const b = buf.subarray('ascii', i + 1, len - 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it's important but len-1 maybe out of the buffer bounds, we may want to throw in that case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to not touch the implementation. The upstream package is widely used and I think there's more risk from changing this than to leave in a potential defect.

const b = buf.subarray('ascii', i + 1, len - 1)
const keyIndex = b.indexOf('=')
if (keyIndex === -1) return result
result[b.slice(0, keyIndex)] = b.slice(keyIndex + 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uses deprecated slice while the rest of the code uses subarray, perhaps it should be explicit toString()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above, let's not touch the implementation needlessly.

// ustar (posix) format.
// prepend prefix, if present.
if (buf[345]) name = decodeStr(buf, 345, 155, filenameEncoding) + '/' + name
} else if (isGNU(buf)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need both POSIX UStar and GNU?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above, let's not touch the implementation needlessly.

}

// to support old tar versions that use trailing / to indicate dirs
if (typeflag === 0 && name && name[name.length - 1] === '/') typeflag = 5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably doesn't hurt, but I don't think we need to support the old versions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above, let's not touch the implementation needlessly.


function onfile () {
const ws = xfs.createWriteStream(name)
const rs = mapStream(stream, header)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is always just stream.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above, let's not touch the implementation needlessly.

if (this._parent._stream === this) {
this._parent._update()
}
- cb(null)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did this go away?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before, this class extended streamx.Readable, but it now extends node:stream.Readable. The _read implementation is slightly different, see https://github.com/mafintosh/streamx/blob/be4bbc8ba0b4c862a3d996e484675d4b6297136d/README.md?plain=1#L136-L137. Luckily it was synchronous before, so we can remove this callback easily.

Copy link
Member Author

@Skn0tt Skn0tt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. I'll act on removing the dead code and fixing the patch.

I'd prefer to not touch the implementation more than absolutely necessary. The library is widely-used enough that any change introduces more risk than it averts. If we want to write our own, i'm open to that in a follow-up.

} else {
const file = fs.createWriteStream(options.zipPath);
await downloadFile(options, file);
log(`SUCCESS downloading ${options.title}`);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to do that in a separate PR, but that would be a behaviour change to before. What'd be the benefit?

const len = parseInt(buf.toString('ascii', 0, i), 10)
if (!len) return result

const b = buf.subarray('ascii', i + 1, len - 1)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to not touch the implementation. The upstream package is widely used and I think there's more risk from changing this than to leave in a potential defect.

const b = buf.subarray('ascii', i + 1, len - 1)
const keyIndex = b.indexOf('=')
if (keyIndex === -1) return result
result[b.slice(0, keyIndex)] = b.slice(keyIndex + 1)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above, let's not touch the implementation needlessly.

// ustar (posix) format.
// prepend prefix, if present.
if (buf[345]) name = decodeStr(buf, 345, 155, filenameEncoding) + '/' + name
} else if (isGNU(buf)) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above, let's not touch the implementation needlessly.

}

// to support old tar versions that use trailing / to indicate dirs
if (typeflag === 0 && name && name[name.length - 1] === '/') typeflag = 5
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above, let's not touch the implementation needlessly.


function onfile () {
const ws = xfs.createWriteStream(name)
const rs = mapStream(stream, header)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above, let's not touch the implementation needlessly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like headers.js somehow wasn't included in the diff 🤔 Let me fix that.

if (this._parent._stream === this) {
this._parent._update()
}
- cb(null)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before, this class extended streamx.Readable, but it now extends node:stream.Readable. The _read implementation is slightly different, see https://github.com/mafintosh/streamx/blob/be4bbc8ba0b4c862a3d996e484675d4b6297136d/README.md?plain=1#L136-L137. Luckily it was synchronous before, so we can remove this callback easily.

return decodeStr(buf, 0, buf.length, encoding)
}

exports.encodePax = function encodePax (opts) { // TODO: encode more stuff in pax
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Will do.

Copy link
Member Author

@Skn0tt Skn0tt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the dead code and fixed that patch. Not sure how headers.js slipped. The way I generated it is by pasting the files from the linked commits into a folder called original, pasting the edited files into a patched folder, and then running git diff --no-index -- original patched.

Copy link
Contributor

Test results for "tests 1"

12 flaky ⚠️ [chromium-page] › tests/page/page-event-request.spec.ts:151:3 › should report navigation requests and responses handled by service worker with routing @chromium-ubuntu-22.04-node18
⚠️ [firefox-page] › tests/page/page-evaluate.spec.ts:403:3 › should throw for too deep reference chain @firefox-ubuntu-22.04-node18
⚠️ [playwright-test] › tests/ui-mode-trace.spec.ts:341:5 › should work behind reverse proxy @ubuntu-latest-node22-1
⚠️ [webkit-library] › tests/library/defaultbrowsercontext-2.spec.ts:28:3 › should work in persistent context @webkit-ubuntu-22.04-node18
⚠️ [webkit-library] › tests/library/proxy.spec.ts:44:3 › should use proxy for second page @webkit-ubuntu-22.04-node18
⚠️ [webkit-library] › tests/library/screenshot.spec.ts:278:14 › element screenshot › should restore viewport after page screenshot and exception @webkit-ubuntu-22.04-node18
⚠️ [webkit-library] › tests/library/selector-generator.spec.ts:68:5 › selector generator › should generate text for @webkit-ubuntu-22.04-node18
⚠️ [webkit-library] › tests/library/video.spec.ts:475:5 › screencast › should scale frames down to the requested size @webkit-ubuntu-22.04-node18
⚠️ [webkit-page] › tests/page/page-leaks.spec.ts:82:5 › click should not leak @webkit-ubuntu-22.04-node18
⚠️ [webkit-page] › tests/page/page-leaks.spec.ts:107:5 › fill should not leak @webkit-ubuntu-22.04-node18
⚠️ [webkit-page] › tests/page/page-set-input-files.spec.ts:147:3 › should upload large file @webkit-ubuntu-22.04-node18
⚠️ [playwright-test] › tests/ui-mode-test-watch.spec.ts:145:5 › should watch all @windows-latest-node18-1

37725 passed, 654 skipped
✔️✔️✔️

Merge workflow run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants