-
Notifications
You must be signed in to change notification settings - Fork 7.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
grapheme_extract should pass over invalid surrogate halves #17568
base: master
Are you sure you want to change the base?
grapheme_extract should pass over invalid surrogate halves #17568
Conversation
Many systems incorrectly encode surrogate halves from a UTF-16 stream into UTF-8 as two three-byte characters instead of the proper four-byte sequence. These are invalid charaters in UTF-8 and should be skipped when decoding with `grapheme_extract` but it’s not currently handling these properly. > If offset does not point to the first byte of a UTF-8 character, > the start position is moved to the next character boundary. For example, U+1F170 (d83c dd70) should encode as F0 9F 85 B0, but when applying the UTF-8 encoder invalidly to d83c, the output would be ED A0 BD. This entire span of bytes is invalid UTF-8. ```php grapheme_extract( "\xED\xA0\xBDa", 1, GRAPHEME_EXTR_COUNT, 0, $next ); // returns "\xED", an invalid UTF-8 byte sequence // $next === 1, pointing into the middle of the invalid sequence ``` Instead, it should return “a” and point `$next` to the end of the string.
e0a5736
to
e7f9aec
Compare
My understand is surrogate pair that only UTF-16. UTF-8 can cover all code points. |
This is almost correct, as UTF-8 encodes Unicode Scalar Values, which prohibit the unassigned code points in the surrogate range.
Further, both Particularly since the documentation states that |
Hmm, the current behavior does not really make sense, but I'm not sure what the proper behavior would be. The
Now, invalid UTF-8 is not UTF-8.
This may still make sense even if we assume valid UTF-8, because a user might just start at a wrong position. And then we have: https://3v4l.org/iLSL9. Given that these are not (valid) code points, what am I missing? |
@cmb69 I believe it’s the $bi = IntlBreakIterator::createCodePointInstance();
$bi->setText("A\xED\xA0\xBDa");
foreach ($bi->getPartsIterator() as $cp) {
var_dump($bi->getErrorMessage(), bin2hex($cp));
} this produces
|
This is a fair point, and if it comes to the function rejecting these inputs that would be tolerable, though less useful. Doing so would require that the string have already been pre-scanned for having a valid UTF-8 encoding, or scan through before reading the first grapheme/code-point/bytes. Given that this is inherently a streaming function, it’s a useful property to return valid sequences where they exist without having to read the entire string first and without rejecting a valid prefix because of later problems. Being able to identify those invalid byte sequences also helps, giving user-space code the choice of whether to replace the sequence with U+FFFD for display, or pass through the invalid bytes unaltered, or remove them. |
That doesn't really matter here. Iterating over the "CodePointIterator" gives the byte offsets (int is converted to string when passed to
I can see valid arguments for both approaches, but I'm unsure what I'd prefer. I would like some feedback from others. |
See #17404
Many systems incorrectly encode surrogate halves from a UTF-16 stream into UTF-8 as two three-byte characters instead of the proper four-byte sequence. These are invalid charaters in UTF-8 and should be skipped when decoding with
grapheme_extract
but it’s not currently handling these properly.For example, U+1F170 (d83c dd70) should encode as F0 9F 85 B0, but when applying the UTF-8 encoder invalidly to d83c, the output would be ED A0 BD. This entire span of bytes is invalid UTF-8.
Instead, it should return “a” and point
$next
to the end of the string.