Fix error when reading from stdout splits UTF-8 codepoint #144

9999years · 2023-10-25T00:14:24Z

When we read from stdout, we get some number of bytes. In some cases, when ghci is outputting non-ASCII characters, this can lead to a single UTF-8 codepoint (represented as multiple bytes) being split across two reads. (Or potentially more, in perverse cases?)

Fortunately, the std UTF-8 decoder makes this information available and allows us to distinguish between truncated continuation sequences and truly invalid data. For truncated continuation sequences, we decode as much as we can (fortunately the std UTF-8 decoder makes this information available as well) and save the last few bytes (we know there are only 1-3 of them) for the next time we read data.

When we read from stdout, we get some number of bytes. In some cases, when `ghci` is outputting non-ASCII characters, this can lead to a single UTF-8 codepoint (represented as multiple bytes) being split across two reads. (Or potentially more, in perverse cases?) Fortunately, the `std` UTF-8 decoder makes this information available and allows us to distinguish between truncated continuation sequences and truly invalid data. For truncated continuation sequences, we decode as much as we can (fortunately the `std` UTF-8 decoder makes this information available as well) and save the last few bytes (we know there are only 1-3 of them) for the next time we read data.

linear · 2023-10-25T00:14:27Z

DUX-1486 UTF8 Parsing Bug

Running the test suite on Ian's branch for reducing GHCi memory usage gives this error:

Error:   × Failed to start `ghci`
  ├─▶ Read invalid UTF-8: "    ... [�"
  ╰─▶ incomplete utf-8 byte sequence from index 1023

I'm suspicious about incomplete utf-8 byte sequence from index 1023. Feels like we've got a vector of 1024 bytes and we just happened to chomp off a UTF8 code point in the middle.

I'm guessing that we just haven't seen this one yet because we can't really run all the tests without OOMing on a 64GB laptop.

9999years · 2023-10-25T00:15:46Z

src/incremental_reader.rs

+                        // End of input reached unexpectedly.
+                        let valid_utf8 = &buffer[..err.valid_up_to()];
+                        self.non_utf8.extend(&buffer[err.valid_up_to()..]);
+                        unsafe {


I think this is the first unsafe block in the codebase? Some dubious honor there.

parsonsmatt

Great job!

evanrelf · 2023-10-25T02:45:35Z

src/incremental_reader.rs

            lines: String::with_capacity(VEC_BUFFER_CAPACITY * LINE_BUFFER_CAPACITY),
            line: String::with_capacity(LINE_BUFFER_CAPACITY),
-            writer: None,
+            non_utf8: Vec::with_capacity(SPLIT_UTF8_CODEPOINT_CAPACITY),


If you're never going to exceed 4 bytes here, would it be more efficient to use &[u8; 4] here instead of Vec<u8>?

Or more correct, because you couldn't accidentally allocate more space.

github-actions · 2023-10-30T17:19:38Z

A PR to release these changes has been created, bumping the version from 0.3.8 to 0.3.9.

9999years requested a review from parsonsmatt October 25, 2023 00:14

github-actions bot added the patch Bug fixes or non-functional changes label Oct 25, 2023

9999years commented Oct 25, 2023

View reviewed changes

parsonsmatt approved these changes Oct 25, 2023

View reviewed changes

evanrelf reviewed Oct 25, 2023

View reviewed changes

9999years merged commit 8ee53b9 into main Oct 30, 2023
30 checks passed

9999years deleted the rebeccat/dux-1486-utf8-parsing-bug branch October 30, 2023 17:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix error when reading from stdout splits UTF-8 codepoint #144

Fix error when reading from stdout splits UTF-8 codepoint #144

9999years commented Oct 25, 2023

linear bot commented Oct 25, 2023

9999years Oct 25, 2023

parsonsmatt left a comment

evanrelf Oct 25, 2023

evanrelf Oct 25, 2023

github-actions bot commented Oct 30, 2023

Fix error when reading from stdout splits UTF-8 codepoint #144

Fix error when reading from stdout splits UTF-8 codepoint #144

Conversation

9999years commented Oct 25, 2023

linear bot commented Oct 25, 2023

9999years Oct 25, 2023

Choose a reason for hiding this comment

parsonsmatt left a comment

Choose a reason for hiding this comment

evanrelf Oct 25, 2023

Choose a reason for hiding this comment

evanrelf Oct 25, 2023

Choose a reason for hiding this comment

github-actions bot commented Oct 30, 2023