Document icu_locid's relationship to backwards compatibility syntax #3989

hsivonen · 2023-09-01T09:29:08Z

ECMA-402 IsStructurallyValidLanguageTag says:

If lowerLocale uses any of the backwards compatibility syntax described in Unicode Technical Standard #35 Part 1 Core, Section 3.3 BCP 47 Conformance, return false.

From reading https://unicode-org.github.io/icu4x/docs/icu_locid/struct.Locale.html , it's unclear to me what the correspondence of icu_locid::Locale::try_from_bytes to IsStructurallyValidLanguageTag is. The docs should say if backwards compatibility syntax is allowed or not.

The text was updated successfully, but these errors were encountered:

hsivonen · 2023-09-01T09:29:16Z

CC @zbraniecki

sffc · 2024-12-13T22:48:03Z

The "BCP 47 Conformance" section says

It allows certain syntax for backwards compatibility (not BCP 47-compatible):

The "_" character for field separator characters, as well as the "-" used in [BCP47] (however, the canonical form is with "-")

The subtag "root" to indicate the generic locale used as the parent of all languages in the CLDR data model ("und" can be used instead)

The language tag may begin with a script subtag rather than a language subtag. This is specialized use only, and not required for CLDR conformance.

There is this fn in Test262:

https://github.com/tc39/test262/blob/dc0082c5ea347e5ecb585c1d7ebf4555aa429528/harness/testIntl.js#L320

There are also some tests in that file.

I think the next steps on this issue are:

Make a test in ICU4X similar to the Test262 test above.
If it passes, add the appropriate docs to Locale and close this issue.

zbraniecki · 2024-12-15T00:19:37Z

We can see the current conformance in Boa's suite - https://boajs.dev/conformance

zbraniecki · 2024-12-15T00:29:00Z

We're failing the following three tests:

Here are the missed failures from (1):

de_DE is an invalid tag value
DE_de is an invalid tag value
cmn_Hans is an invalid tag value
cmn-hans_cn is an invalid tag value
es_419 is an invalid tag value
es-419-u-nu-latn-cu_bob is an invalid tag value
cmn-hans-cn-t-ca-u-ca-x_t-u is an invalid tag value
de-gregory_u-ca-gregory is an invalid tag value
si-x is an invalid tag value

zbraniecki · 2024-12-15T00:36:44Z

si-x has been fixed in 2.0, and the rest are all about _ being invalid.

zbraniecki · 2024-12-15T06:37:26Z

In (2) the errors are:

new Intl.Locale("en-u-ca-islamicc").calendar returns islamic-civil
new Intl.Locale("en", { calendar: "islamicc" }).calendar returns islamic-civil
new Intl.Locale("en-u-ca-ethiopic-amete-alem").calendar returns ethioaa
new Intl.Locale("en", { calendar: "ethiopic-amete-alem" }).calendar returns ethioaa

zbraniecki · 2024-12-15T06:49:42Z

The (3) is just posix not being a valid locale to parse. V8 also rejects it, while SpiderMonkey and JSCore accepts it accepts it.

zbraniecki · 2024-12-15T06:50:21Z

I'll mark it for discussion on how to address it.

zbraniecki · 2024-12-15T07:08:59Z

Decisions to be made:

1. How do we want to handle `_` separator.

Options:
a) Remove support for _ from ICU4X 2.0
b) Introduce a bcp47_mode to our parser
c) Introduce LocaleParserConfig with separator type
d) Introduce Locale::try_from_utf8_bcp47 which for now can just check for _ before parsing.
e) Introduce BCP47Locale ?

2. How to handle alias resolution

a) Include in parsing
b) Advise Boa and document that for ECMA-402 compat LocaleCanonicalizer has to be used

3. What should we do with 5-8 Langauges

They are allowed in Unicode Locale Id, but not in Unicode BCP47 Locale Id.

sffc · 2024-12-15T20:48:17Z

I recall a discussion from a little while ago where we said we would accept _ and Boa should just check for the _, and otherwise we would be compliant. I'm okay with that, but we should document clearly with some examples and test cases.

For alias resolution, that should be the job of the LocaleCanonicalizer, I think.

Manishearth · 2024-12-19T20:28:10Z

@hsivonen If no browsers support the long language subtags, we should remove that from ECMA-402
@zbraniecki Agreed. But my concern is that I don't want ECMA-402 to produce a third locale type: we already have the Unicode Locale and the BCP-47 Locale. But I don't want to punish stack size and performance for everyone.
@sffc Seems like we can just say that the ECMA-402 locale is the strict subset.

hsivonen · 2024-12-20T09:29:07Z

Filed https://bugzilla.mozilla.org/show_bug.cgi?id=1938524 . I encourage pursuing a use counter for other engines.

Manishearth · 2024-12-20T23:07:26Z

I documented the differences in terms of EBNF syntax

UTS 35

unicode_language_id            = "root" | (unicode_language_subtag (sep unicode_script_subtag)? | unicode_script_subtag) (sep unicode_region_subtag)? (sep unicode_variant_subtag)* ;
unicode_language_subtag        = alpha{2,3} | alpha{5,8};
unicode_script_subtag          = alpha{4} ;
unicode_region_subtag          = (alpha{2} | digit{3}) ;
unicode_variant_subtag         = (alphanum{5,8} | digit alphanum{3}) ;
sep                            = [-_] ;
digit                          = [0-9] ;
alpha                          = [A-Z a-z] ;
alphanum                       = [0-9 A-Z a-z] ;



unicode_locale_id              = unicode_language_id extensions*  pu_extensions? ;  
extensions                     = unicode_locale_extensions | transformed_extensions | other_extensions ;    
unicode_locale_extensions      = sep [uU] ((sep keyword)+ |(sep attribute)+ (sep ufield)*) ;    
transformed_extensions         = sep [tT] ((sep tlang (sep tfield)*) | (sep tfield)+) ;    
pu_extensions                  = sep [xX] (sep alphanum{1,8})+ ;  
other_extensions               = sep [alphanum-[tTuUxX]] (sep alphanum{2,8})+ ;  
ufield / keyword               = ukey (sep uvalue)? ;  
ukey   / key                   = alphanum alpha ;
uvalue / type                  = alphanum{3,8} (sep alphanum{3,8})* ; 
attribute                      = alphanum{3,8} ;   
unicode_subdivision_id         = unicode_region_subtag unicode_subdivision_suffix ; 
unicode_subdivision_suffix     = alphanum{1,4} ;   
unicode_measure_unit           = alphanum{3,8} (sep alphanum{3,8})* ;
tlang                          = unicode_language_subtag (sep unicode_script_subtag)? (sep unicode_region_subtag)? (sep unicode_variant_subtag)* ;
tfield                         = tkey tvalue;
tkey                           = alpha digit ; 
tvalue                         = alphanum{3,8} (sep alphanum{3,8})+ ;

Notable quirk: pu_extensions will include 1-length subtags, so und-x-foo-bar-u-nu-latn does not have a unicode subtag.

UTS 35 without backcompat as used by ECMA402

# All other nonterminals from UTS 35 above


# The "_" character for field separator characters, as well as the "-" used in [BCP47] (however, the canonical form is with "-")
sep                            = [-] ;

# The subtag "root" to indicate the generic locale used as the parent of all languages in the CLDR data model ("und" can be used instead)
# The language tag may begin with a script subtag rather than a language subtag. This is specialized use only, and not required for CLDR conformance.
unicode_language_id            = unicode_language_subtag (sep unicode_script_subtag)? (sep unicode_region_subtag)? (sep unicode_variant_subtag)* ;

# Simplification
tlang = unicode_language_id

Further ECMA402 constraints:

no duplicate singleton subtags
no duplicate language variants in the transform

ICU4X

# All other nonterminals from UTS 35 above


# No "root" locale
# Does not allow starting with a script, MUST use und
unicode_language_id            = unicode_language_subtag (sep unicode_script_subtag)? (sep unicode_region_subtag)? (sep unicode_variant_subtag)* ;

# Does not support larger language subtags
unicode_language_subtag        = alpha{2,3} ;

# Simplification
tlang = unicode_language_id

# These are non syntactic distinctions that affect the format of the parse result
# ICU4X just doesn't distinguish between ufield and keyword
unicode_locale_extensions      = sep [uU] ((sep keyword)+ |(sep attribute)+ (sep keyword)*) ;
# ICU4X doesn't actually *store* values that are called "true", so `u-xy-true` just parses as a keyword `xy`.    
uvalue / type                  = ("true" | alphanum{3,8}) (sep ("true" | alphanum{3,8}))*

Further ICU4X constraints:

no duplicate singleton subtags
merges duplicate unicode attributes
merges duplicate transform fields

Manishearth · 2024-12-20T23:16:00Z

The main differences are:

ICU4X allows _ as a separator, ECMA402 does not
ICU4X language subtags are only 2 or 3 characters, ECMA402 allows longer
ICU4X treats -u-kw-true as identical to -u-kw, which may cause serialization differences. This should not matter practically.

ECMA402 and ICU4X agree on not starting locales with scripts. They also agree on no duplicate singleton subtags.

ICU4X does not error on, but will merge entries for duplicate unicode attributes and transform fields. ECMA402 requires that transform language variants have no duplicates.

I think what we should do is this:

Move ICU4X away from _ in locale parsing. People wishing to use ICU4X with legacy locales can perform the replacement, or we can provide a legacy parsing function
Fix the ICU4X bug
Update ECMA402 to not allow long language subtags. They're not used yet.
Update ICU4X to have the "no duplicate language variants in transform" constraint, it seems reasonable.

For aliases, I do not think this is a parsing concern. ECMA402 implementations should canonicalize during parsing, as documented in Intl.Locale(). ICU4X may need to learn how to canonicalize calendars.

sffc · 2024-12-21T05:03:23Z

TG2 discussion: https://github.com/tc39/ecma402/blob/main/meetings/notes-2024-12-19.md#document-icu_locids-relationship-to-backwards-compatibility-syntax-3989

sffc · 2024-12-21T05:07:11Z

402 issue for removing support for long language subtags: tc39/ecma402#951

sffc · 2024-12-26T19:07:39Z

Thought: if we wanted to support long language subtags, we could do it without impacting the stack size of LanguageIdentifier by stuffing it in the Variants list and adding an enum variant to the Language field (there should be enough niches for that).

Manishearth · 2024-12-26T19:17:11Z

@sffc yeah I was considering that. I couldn't come up with a great design for it but there are hacks.

Part of #3989

hsivonen added C-locale Component: Locale identifiers, BCP47 U-ecma402 User: ECMA-402 compatibility labels Sep 1, 2023

sffc added T-docs-tests Type: Code change outside core library S-small Size: One afternoon (small bug fix or enhancement) labels Sep 21, 2023

sffc self-assigned this Sep 21, 2023

sffc added this to the 1.x Priority ⟨P2⟩ milestone Sep 21, 2023

sffc assigned zbraniecki and unassigned sffc Dec 14, 2024

zbraniecki added the discuss Discuss at a future ICU4X-SC meeting label Dec 15, 2024

sffc added discuss-priority Discuss at the next ICU4X meeting and removed discuss Discuss at a future ICU4X-SC meeting labels Dec 15, 2024

sffc moved this to Priority Issues in ECMA-402 Meeting Topics Dec 19, 2024

sffc added this to ECMA-402 Meeting Topics Dec 19, 2024

sffc moved this from Priority Issues to Previously Discussed in ECMA-402 Meeting Topics Dec 19, 2024

Manishearth removed the discuss-priority Discuss at the next ICU4X meeting label Dec 19, 2024

sffc assigned Manishearth and unassigned zbraniecki Dec 20, 2024

Manishearth assigned sffc Dec 20, 2024

Manishearth mentioned this issue Dec 20, 2024

Stop accepting underscores as subtag separators #5943

Merged

sffc mentioned this issue Dec 21, 2024

ECMA-402 should stop accepting language subtags with more than 3 letters tc39/ecma402#951

Open

Manishearth added a commit that referenced this issue Jan 14, 2025

Stop accepting underscores as subtag separators (#5943)

77ee3e8

Part of #3989

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document icu_locid's relationship to backwards compatibility syntax #3989

Document icu_locid's relationship to backwards compatibility syntax #3989

hsivonen commented Sep 1, 2023 •

edited

Loading

hsivonen commented Sep 1, 2023

sffc commented Dec 13, 2024

zbraniecki commented Dec 15, 2024

zbraniecki commented Dec 15, 2024

zbraniecki commented Dec 15, 2024

zbraniecki commented Dec 15, 2024

zbraniecki commented Dec 15, 2024

zbraniecki commented Dec 15, 2024

zbraniecki commented Dec 15, 2024

sffc commented Dec 15, 2024

Manishearth commented Dec 19, 2024

hsivonen commented Dec 20, 2024

Manishearth commented Dec 20, 2024 •

edited

Loading

Manishearth commented Dec 20, 2024 •

edited

Loading

sffc commented Dec 21, 2024

sffc commented Dec 21, 2024

sffc commented Dec 26, 2024

Manishearth commented Dec 26, 2024

Document icu_locid's relationship to backwards compatibility syntax #3989

Document icu_locid's relationship to backwards compatibility syntax #3989

Comments

hsivonen commented Sep 1, 2023 • edited Loading

hsivonen commented Sep 1, 2023

sffc commented Dec 13, 2024

zbraniecki commented Dec 15, 2024

zbraniecki commented Dec 15, 2024

zbraniecki commented Dec 15, 2024

zbraniecki commented Dec 15, 2024

zbraniecki commented Dec 15, 2024

zbraniecki commented Dec 15, 2024

zbraniecki commented Dec 15, 2024

1. How do we want to handle _ separator.

2. How to handle alias resolution

3. What should we do with 5-8 Langauges

sffc commented Dec 15, 2024

Manishearth commented Dec 19, 2024

hsivonen commented Dec 20, 2024

Manishearth commented Dec 20, 2024 • edited Loading

UTS 35

UTS 35 without backcompat as used by ECMA402

ICU4X

Manishearth commented Dec 20, 2024 • edited Loading

sffc commented Dec 21, 2024

sffc commented Dec 21, 2024

sffc commented Dec 26, 2024

Manishearth commented Dec 26, 2024

hsivonen commented Sep 1, 2023 •

edited

Loading

1. How do we want to handle `_` separator.

Manishearth commented Dec 20, 2024 •

edited

Loading

Manishearth commented Dec 20, 2024 •

edited

Loading