Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix error when reading from stdout splits UTF-8 codepoint #144

Merged
merged 1 commit into from
Oct 30, 2023

Conversation

9999years
Copy link
Member

When we read from stdout, we get some number of bytes. In some cases, when ghci is outputting non-ASCII characters, this can lead to a single UTF-8 codepoint (represented as multiple bytes) being split across two reads. (Or potentially more, in perverse cases?)

Fortunately, the std UTF-8 decoder makes this information available and allows us to distinguish between truncated continuation sequences and truly invalid data. For truncated continuation sequences, we decode as much as we can (fortunately the std UTF-8 decoder makes this information available as well) and save the last few bytes (we know there are only 1-3 of them) for the next time we read data.

When we read from stdout, we get some number of bytes. In some cases,
when `ghci` is outputting non-ASCII characters, this can lead to a
single UTF-8 codepoint (represented as multiple bytes) being split
across two reads. (Or potentially more, in perverse cases?)

Fortunately, the `std` UTF-8 decoder makes this information available
and allows us to distinguish between truncated continuation sequences
and truly invalid data. For truncated continuation sequences, we decode
as much as we can (fortunately the `std` UTF-8 decoder makes this
information available as well) and save the last few bytes (we know
there are only 1-3 of them) for the next time we read data.
@9999years 9999years requested a review from parsonsmatt October 25, 2023 00:14
@linear
Copy link

linear bot commented Oct 25, 2023

DUX-1486 UTF8 Parsing Bug

Running the test suite on Ian's branch for reducing GHCi memory usage gives this error:

Error:   × Failed to start `ghci`
  ├─▶ Read invalid UTF-8: "    ... [�"
  ╰─▶ incomplete utf-8 byte sequence from index 1023

I'm suspicious about incomplete utf-8 byte sequence from index 1023. Feels like we've got a vector of 1024 bytes and we just happened to chomp off a UTF8 code point in the middle.

I'm guessing that we just haven't seen this one yet because we can't really run all the tests without OOMing on a 64GB laptop.

@github-actions github-actions bot added the patch Bug fixes or non-functional changes label Oct 25, 2023
// End of input reached unexpectedly.
let valid_utf8 = &buffer[..err.valid_up_to()];
self.non_utf8.extend(&buffer[err.valid_up_to()..]);
unsafe {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the first unsafe block in the codebase? Some dubious honor there.

Copy link
Contributor

@parsonsmatt parsonsmatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job!

lines: String::with_capacity(VEC_BUFFER_CAPACITY * LINE_BUFFER_CAPACITY),
line: String::with_capacity(LINE_BUFFER_CAPACITY),
writer: None,
non_utf8: Vec::with_capacity(SPLIT_UTF8_CODEPOINT_CAPACITY),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're never going to exceed 4 bytes here, would it be more efficient to use &[u8; 4] here instead of Vec<u8>?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or more correct, because you couldn't accidentally allocate more space.

@9999years 9999years merged commit 8ee53b9 into main Oct 30, 2023
30 checks passed
@9999years 9999years deleted the rebeccat/dux-1486-utf8-parsing-bug branch October 30, 2023 17:12
@github-actions
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
patch Bug fixes or non-functional changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants