Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
grapheme_extract should pass over invalid surrogate halves
Many systems incorrectly encode surrogate halves from a UTF-16 stream into UTF-8 as two three-byte characters instead of the proper four-byte sequence. These are invalid charaters in UTF-8 and should be skipped when decoding with `grapheme_extract` but it’s not currently handling these properly. > If offset does not point to the first byte of a UTF-8 character, > the start position is moved to the next character boundary. For example, U+1F170 (d83c dd70) should encode as F0 9F 85 B0, but when applying the UTF-8 encoder invalidly to d83c, the output would be ED A0 BD. This entire span of bytes is invalid UTF-8. ```php grapheme_extract( "\xED\xA0\xBDa", 1, GRAPHEME_EXTR_COUNT, 0, $next ); // returns "\xED", an invalid UTF-8 byte sequence // $next === 1, pointing into the middle of the invalid sequence ``` Instead, it should return “a” and point `$next` to the end of the string.
- Loading branch information