-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix error when reading from stdout splits UTF-8 codepoint #144
Conversation
When we read from stdout, we get some number of bytes. In some cases, when `ghci` is outputting non-ASCII characters, this can lead to a single UTF-8 codepoint (represented as multiple bytes) being split across two reads. (Or potentially more, in perverse cases?) Fortunately, the `std` UTF-8 decoder makes this information available and allows us to distinguish between truncated continuation sequences and truly invalid data. For truncated continuation sequences, we decode as much as we can (fortunately the `std` UTF-8 decoder makes this information available as well) and save the last few bytes (we know there are only 1-3 of them) for the next time we read data.
DUX-1486 UTF8 Parsing Bug
Running the test suite on Ian's branch for reducing GHCi memory usage gives this error:
I'm suspicious about I'm guessing that we just haven't seen this one yet because we can't really run all the tests without OOMing on a 64GB laptop. |
// End of input reached unexpectedly. | ||
let valid_utf8 = &buffer[..err.valid_up_to()]; | ||
self.non_utf8.extend(&buffer[err.valid_up_to()..]); | ||
unsafe { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the first unsafe
block in the codebase? Some dubious honor there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job!
lines: String::with_capacity(VEC_BUFFER_CAPACITY * LINE_BUFFER_CAPACITY), | ||
line: String::with_capacity(LINE_BUFFER_CAPACITY), | ||
writer: None, | ||
non_utf8: Vec::with_capacity(SPLIT_UTF8_CODEPOINT_CAPACITY), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're never going to exceed 4 bytes here, would it be more efficient to use &[u8; 4]
here instead of Vec<u8>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or more correct, because you couldn't accidentally allocate more space.
When we read from stdout, we get some number of bytes. In some cases, when
ghci
is outputting non-ASCII characters, this can lead to a single UTF-8 codepoint (represented as multiple bytes) being split across two reads. (Or potentially more, in perverse cases?)Fortunately, the
std
UTF-8 decoder makes this information available and allows us to distinguish between truncated continuation sequences and truly invalid data. For truncated continuation sequences, we decode as much as we can (fortunately thestd
UTF-8 decoder makes this information available as well) and save the last few bytes (we know there are only 1-3 of them) for the next time we read data.