Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reflect changes in GB 18030-2022 #312

Open
achristensen07 opened this issue Jun 6, 2023 · 8 comments · May be fixed by #336
Open

Reflect changes in GB 18030-2022 #312

achristensen07 opened this issue Jun 6, 2023 · 8 comments · May be fixed by #336
Labels
i18n-clreq Notifies Chinese script experts of relevant issues i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response.

Comments

@achristensen07
Copy link

https://encoding.spec.whatwg.org/index-gb18030.txt contains 18 code points that have been changed by GB 18030-2022. We should probably update.

@domenic
Copy link
Member

domenic commented Jun 7, 2023

Related: #27 #57. It's not clear whether web browsers want to update these, or leave legacy encodings forever-unchanged.

webkit-commit-queue pushed a commit to achristensen07/WebKit that referenced this issue Jun 7, 2023
https://bugs.webkit.org/show_bug.cgi?id=257770
rdar://110353061

Reviewed by Myles C. Maxfield.

This was already done internally in ICU in rdar://107702106
This reflects changes published as GB-18030-2022
This was proposed as a change to the standard at whatwg/encoding#312
This fixes an assertion when running encoding tests on macOS Sonoma and iOS 17,
and I added test coverage specific to the 18 changed code points.

* LayoutTests/imported/w3c/web-platform-tests/encoding/legacy-mb-schinese/gb18030/gb18030-encoder-expected.txt:
* LayoutTests/imported/w3c/web-platform-tests/encoding/legacy-mb-schinese/gb18030/gb18030-encoder.html:
* Source/WTF/wtf/PlatformHave.h:
* Source/WebCore/PAL/pal/text/EncodingTables.cpp:
(PAL::gb18030):

Canonical link: https://commits.webkit.org/264918@main
@litherum
Copy link

litherum commented Jun 14, 2023

We, at least, definitely want to update (in fact, we already did). The definition of this encoding has been updated upstream. There needs to be a consistent behavior between the web browser on a platform and native apps on the platform. As the adage goes, "the future is longer than the past" and there will be more content produced with the new mappings than there is existing content. We can't just close our eyes and hope that all authors use UTF-8, especially when there are laws requiring that ~all products sold in certain places must conform.

@hsivonen
Copy link
Member

@litherum
Copy link

Yes.

mnutt pushed a commit to movableink/webkit that referenced this issue Jun 28, 2023
https://bugs.webkit.org/show_bug.cgi?id=257770
rdar://110353061

Reviewed by Myles C. Maxfield.

This was already done internally in ICU in rdar://107702106
This reflects changes published as GB-18030-2022
This was proposed as a change to the standard at whatwg/encoding#312
This fixes an assertion when running encoding tests on macOS Sonoma and iOS 17,
and I added test coverage specific to the 18 changed code points.

* LayoutTests/imported/w3c/web-platform-tests/encoding/legacy-mb-schinese/gb18030/gb18030-encoder-expected.txt:
* LayoutTests/imported/w3c/web-platform-tests/encoding/legacy-mb-schinese/gb18030/gb18030-encoder.html:
* Source/WTF/wtf/PlatformHave.h:
* Source/WebCore/PAL/pal/text/EncodingTables.cpp:
(PAL::gb18030):

Canonical link: https://commits.webkit.org/264918@main
@achristensen07
Copy link
Author

We ended up moving away from that recommendation and going with the exact GB 18030-2022 mappings instead.

@xfq xfq added i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. i18n-clreq Notifies Chinese script experts of relevant issues labels Oct 13, 2023
@achristensen07
Copy link
Author

We have since moved to the Unicode recommendation. Either way, something related to GB 18030-2022 should probably be reflected in this standard.

@annevk
Copy link
Member

annevk commented Sep 17, 2024

These are the Unicode recommendations:

# GB 18030 & Unicode Standard Transcoding Recommendations
#
# Transcoding Between GB 18030 & Unicode Standard
0xA6D9 <-> U+FE10
0xA6DA <-> U+FE12
0xA6DB <-> U+FE11
0xA6DC <-> U+FE13
0xA6DD <-> U+FE14
0xA6DE <-> U+FE15
0xA6DF <-> U+FE16
0xA6EC <-> U+FE17
0xA6ED <-> U+FE18
0xA6F3 <-> U+FE19
0xFE59 <-> U+9FB4
0xFE61 <-> U+9FB5
0xFE66 <-> U+9FB6
0xFE67 <-> U+9FB7
0xFE6D <-> U+9FB8
0xFE7E <-> U+9FB9
0xFE90 <-> U+9FBA
0xFEA0 <-> U+9FBB
#
# Transcoding From Unicode Standard to GB 18030
U+E78D -> 0xA6D9
U+E78E -> 0xA6DA
U+E78F -> 0xA6DB
U+E790 -> 0xA6DC
U+E791 -> 0xA6DD
U+E792 -> 0xA6DE
U+E793 -> 0xA6DF
U+E794 -> 0xA6EC
U+E795 -> 0xA6ED
U+E796 -> 0xA6F3
U+E81E -> 0xFE59
U+E826 -> 0xFE61
U+E82B -> 0xFE66
U+E82C -> 0xFE67
U+E832 -> 0xFE6D
U+E843 -> 0xFE7E
U+E854 -> 0xFE90
U+E864 -> 0xFEA0
#
# Transcoding From GB 18030 to Unicode Standard
0x82359037 -> U+9FB4
0x82359038 -> U+9FB5
0x82359039 -> U+9FB6
0x82359130 -> U+9FB7
0x82359131 -> U+9FB8
0x82359132 -> U+9FB9
0x82359133 -> U+9FBA
0x82359134 -> U+9FBB
0x84318236 -> U+FE10
0x84318237 -> U+FE11
0x84318238 -> U+FE12
0x84318239 -> U+FE13
0x84318330 -> U+FE14
0x84318331 -> U+FE15
0x84318332 -> U+FE16
0x84318333 -> U+FE17
0x84318334 -> U+FE18
0x84318335 -> U+FE19
#
# EOF

Notably these only impact gb18030, not GBK. So I don't think we want to change index-gb18030 in the Encoding Standard. Although we might want to update the note about it reflecting GB18030-2005 in some way?

Instead we would have to directly patch the gb18030 encoder and decoder. Albeit likely somewhat ugly that does not seem too bad. I think I would also duplicate the code points that can be transcoded in two directions for simplicity, although we could also create a mini index for them.

@annevk
Copy link
Member

annevk commented Sep 17, 2024

I created #336 which I hope addresses this.

@annevk annevk mentioned this issue Sep 18, 2024
5 tasks
annevk added a commit that referenced this issue Sep 18, 2024
This implements the Unicode Technical Committee recommendation around GB18030-2022 in a matter suitable for this standard, taking into account existing practice and the closeness between GBK and gb18030.

In particular, using the text file attached to https://www.unicode.org/L2/L2023/23003r-gb18030-recommendations.pdf this does the following:

1. Merges the first set of 18 mappings, which are bidirectional, directly into index gb18030, replacing existing PUA entries. This ends up impacting GBK and gb18030.
2. The second set of 18 mappings (from PUA to bytes) are encoded as an encoder only table, for both GBK and gb18030.
3. The third set of 18 mappings (from bytes to code points) are ignored, as they are already covered by index gb18030 ranges. (Presumably they are included because the recommendation covers the transition from "Previous Mappings" to "Current Mappings" to "Recommended Mappings", whereas we are going directly from "Previous Mappings" to "Recommended Mappings".)

The reason for changing GBK as well is because Chromium and WebKit have already code in the wild that impacts GBK to some degree (although the encoder only table is excluded for GBK only at the moment, including that would make the most sense compatibility-wise) and no fallout has been recorded. Additionally GBK is already positioned as a rough subset of gb18030 in this standard, with the decoder being shared completely.

Tests: encoding/legacy-mb-schinese has some GB18030-2022 coverage already. The aim is to complete that with web-platform-tests/wpt#48239 and web-platform-tests/wpt#48240.

This supersedes #335. This fixes #27 and fixes #312.
@annevk annevk linked a pull request Sep 18, 2024 that will close this issue
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
i18n-clreq Notifies Chinese script experts of relevant issues i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response.
Development

Successfully merging a pull request may close this issue.

6 participants