Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document icu_locid's relationship to backwards compatibility syntax #3989

Open
hsivonen opened this issue Sep 1, 2023 · 18 comments
Open

Document icu_locid's relationship to backwards compatibility syntax #3989

hsivonen opened this issue Sep 1, 2023 · 18 comments
Assignees
Labels
C-locale Component: Locale identifiers, BCP47 S-small Size: One afternoon (small bug fix or enhancement) T-docs-tests Type: Code change outside core library U-ecma402 User: ECMA-402 compatibility

Comments

@hsivonen
Copy link
Member

hsivonen commented Sep 1, 2023

ECMA-402 IsStructurallyValidLanguageTag says:

If lowerLocale uses any of the backwards compatibility syntax described in Unicode Technical Standard #35 Part 1 Core, Section 3.3 BCP 47 Conformance, return false.

From reading https://unicode-org.github.io/icu4x/docs/icu_locid/struct.Locale.html , it's unclear to me what the correspondence of icu_locid::Locale::try_from_bytes to IsStructurallyValidLanguageTag is. The docs should say if backwards compatibility syntax is allowed or not.

@hsivonen hsivonen added C-locale Component: Locale identifiers, BCP47 U-ecma402 User: ECMA-402 compatibility labels Sep 1, 2023
@hsivonen
Copy link
Member Author

hsivonen commented Sep 1, 2023

CC @zbraniecki

@sffc sffc added T-docs-tests Type: Code change outside core library S-small Size: One afternoon (small bug fix or enhancement) labels Sep 21, 2023
@sffc sffc self-assigned this Sep 21, 2023
@sffc sffc added this to the 1.x Priority ⟨P2⟩ milestone Sep 21, 2023
@sffc
Copy link
Member

sffc commented Dec 13, 2024

The "BCP 47 Conformance" section says

It allows certain syntax for backwards compatibility (not BCP 47-compatible):

  • The "_" character for field separator characters, as well as the "-" used in [BCP47] (however, the canonical form is with "-")
  • The subtag "root" to indicate the generic locale used as the parent of all languages in the CLDR data model ("und" can be used instead)
  • The language tag may begin with a script subtag rather than a language subtag. This is specialized use only, and not required for CLDR conformance.

There is this fn in Test262:

https://github.com/tc39/test262/blob/dc0082c5ea347e5ecb585c1d7ebf4555aa429528/harness/testIntl.js#L320

There are also some tests in that file.

I think the next steps on this issue are:

  1. Make a test in ICU4X similar to the Test262 test above.
  2. If it passes, add the appropriate docs to Locale and close this issue.

@sffc sffc assigned zbraniecki and unassigned sffc Dec 14, 2024
@zbraniecki
Copy link
Member

We can see the current conformance in Boa's suite - https://boajs.dev/conformance

@zbraniecki
Copy link
Member

We're failing the following three tests:

  1. https://github.com/tc39/test262/blob/main/test/intl402/Locale/invalid-tag-throws.js
  2. https://github.com/tc39/test262/blob/main/test/intl402/Locale/constructor-options-canonicalized.js
  3. https://github.com/tc39/test262/blob/main/test/intl402/Locale/constructor-non-iana-canon.js

Here are the missed failures from (1):

de_DE is an invalid tag value
DE_de is an invalid tag value
cmn_Hans is an invalid tag value
cmn-hans_cn is an invalid tag value
es_419 is an invalid tag value
es-419-u-nu-latn-cu_bob is an invalid tag value
cmn-hans-cn-t-ca-u-ca-x_t-u is an invalid tag value
de-gregory_u-ca-gregory is an invalid tag value
si-x is an invalid tag value

@zbraniecki
Copy link
Member

si-x has been fixed in 2.0, and the rest are all about _ being invalid.

@zbraniecki
Copy link
Member

In (2) the errors are:

new Intl.Locale("en-u-ca-islamicc").calendar returns islamic-civil
new Intl.Locale("en", { calendar: "islamicc" }).calendar returns islamic-civil
new Intl.Locale("en-u-ca-ethiopic-amete-alem").calendar returns ethioaa
new Intl.Locale("en", { calendar: "ethiopic-amete-alem" }).calendar returns ethioaa

@zbraniecki
Copy link
Member

The (3) is just posix not being a valid locale to parse. V8 also rejects it, while SpiderMonkey and JSCore accepts it accepts it.

@zbraniecki zbraniecki added the discuss Discuss at a future ICU4X-SC meeting label Dec 15, 2024
@zbraniecki
Copy link
Member

I'll mark it for discussion on how to address it.

@zbraniecki
Copy link
Member

Decisions to be made:

1. How do we want to handle _ separator.

Options:
a) Remove support for _ from ICU4X 2.0
b) Introduce a bcp47_mode to our parser
c) Introduce LocaleParserConfig with separator type
d) Introduce Locale::try_from_utf8_bcp47 which for now can just check for _ before parsing.
e) Introduce BCP47Locale ?

2. How to handle alias resolution

a) Include in parsing
b) Advise Boa and document that for ECMA-402 compat LocaleCanonicalizer has to be used

3. What should we do with 5-8 Langauges

They are allowed in Unicode Locale Id, but not in Unicode BCP47 Locale Id.

@sffc sffc added discuss-priority Discuss at the next ICU4X meeting and removed discuss Discuss at a future ICU4X-SC meeting labels Dec 15, 2024
@sffc
Copy link
Member

sffc commented Dec 15, 2024

I recall a discussion from a little while ago where we said we would accept _ and Boa should just check for the _, and otherwise we would be compliant. I'm okay with that, but we should document clearly with some examples and test cases.

For alias resolution, that should be the job of the LocaleCanonicalizer, I think.

@sffc sffc moved this to Priority Issues in ECMA-402 Meeting Topics Dec 19, 2024
@sffc sffc moved this from Priority Issues to Previously Discussed in ECMA-402 Meeting Topics Dec 19, 2024
@Manishearth
Copy link
Member

  • @hsivonen If no browsers support the long language subtags, we should remove that from ECMA-402
  • @zbraniecki Agreed. But my concern is that I don't want ECMA-402 to produce a third locale type: we already have the Unicode Locale and the BCP-47 Locale. But I don't want to punish stack size and performance for everyone.
  • @sffc Seems like we can just say that the ECMA-402 locale is the strict subset.

@Manishearth Manishearth removed the discuss-priority Discuss at the next ICU4X meeting label Dec 19, 2024
@sffc sffc assigned Manishearth and unassigned zbraniecki Dec 20, 2024
@hsivonen
Copy link
Member Author

Filed https://bugzilla.mozilla.org/show_bug.cgi?id=1938524 . I encourage pursuing a use counter for other engines.

@Manishearth
Copy link
Member

Manishearth commented Dec 20, 2024

I documented the differences in terms of EBNF syntax

UTS 35

unicode_language_id            = "root" | (unicode_language_subtag (sep unicode_script_subtag)? | unicode_script_subtag) (sep unicode_region_subtag)? (sep unicode_variant_subtag)* ;
unicode_language_subtag        = alpha{2,3} | alpha{5,8};
unicode_script_subtag          = alpha{4} ;
unicode_region_subtag          = (alpha{2} | digit{3}) ;
unicode_variant_subtag         = (alphanum{5,8} | digit alphanum{3}) ;
sep                            = [-_] ;
digit                          = [0-9] ;
alpha                          = [A-Z a-z] ;
alphanum                       = [0-9 A-Z a-z] ;



unicode_locale_id              = unicode_language_id extensions*  pu_extensions? ;  
extensions                     = unicode_locale_extensions | transformed_extensions | other_extensions ;    
unicode_locale_extensions      = sep [uU] ((sep keyword)+ |(sep attribute)+ (sep ufield)*) ;    
transformed_extensions         = sep [tT] ((sep tlang (sep tfield)*) | (sep tfield)+) ;    
pu_extensions                  = sep [xX] (sep alphanum{1,8})+ ;  
other_extensions               = sep [alphanum-[tTuUxX]] (sep alphanum{2,8})+ ;  
ufield / keyword               = ukey (sep uvalue)? ;  
ukey   / key                   = alphanum alpha ;
uvalue / type                  = alphanum{3,8} (sep alphanum{3,8})* ; 
attribute                      = alphanum{3,8} ;   
unicode_subdivision_id         = unicode_region_subtag unicode_subdivision_suffix ; 
unicode_subdivision_suffix     = alphanum{1,4} ;   
unicode_measure_unit           = alphanum{3,8} (sep alphanum{3,8})* ;
tlang                          = unicode_language_subtag (sep unicode_script_subtag)? (sep unicode_region_subtag)? (sep unicode_variant_subtag)* ;
tfield                         = tkey tvalue;
tkey                           = alpha digit ; 
tvalue                         = alphanum{3,8} (sep alphanum{3,8})+ ;  

Notable quirk: pu_extensions will include 1-length subtags, so und-x-foo-bar-u-nu-latn does not have a unicode subtag.

UTS 35 without backcompat as used by ECMA402

# All other nonterminals from UTS 35 above


# The "_" character for field separator characters, as well as the "-" used in [BCP47] (however, the canonical form is with "-")
sep                            = [-] ;

# The subtag "root" to indicate the generic locale used as the parent of all languages in the CLDR data model ("und" can be used instead)
# The language tag may begin with a script subtag rather than a language subtag. This is specialized use only, and not required for CLDR conformance.
unicode_language_id            = unicode_language_subtag (sep unicode_script_subtag)? (sep unicode_region_subtag)? (sep unicode_variant_subtag)* ;

# Simplification
tlang = unicode_language_id

Further ECMA402 constraints:

  • no duplicate singleton subtags
  • no duplicate language variants in the transform

ICU4X

# All other nonterminals from UTS 35 above


# No "root" locale
# Does not allow starting with a script, MUST use und
unicode_language_id            = unicode_language_subtag (sep unicode_script_subtag)? (sep unicode_region_subtag)? (sep unicode_variant_subtag)* ;

# Does not support larger language subtags
unicode_language_subtag        = alpha{2,3} ;

# Simplification
tlang = unicode_language_id

# These are non syntactic distinctions that affect the format of the parse result
# ICU4X just doesn't distinguish between ufield and keyword
unicode_locale_extensions      = sep [uU] ((sep keyword)+ |(sep attribute)+ (sep keyword)*) ;
# ICU4X doesn't actually *store* values that are called "true", so `u-xy-true` just parses as a keyword `xy`.    
uvalue / type                  = ("true" | alphanum{3,8}) (sep ("true" | alphanum{3,8}))* 

Further ICU4X constraints:

  • no duplicate singleton subtags
  • merges duplicate unicode attributes
  • merges duplicate transform fields

@Manishearth
Copy link
Member

Manishearth commented Dec 20, 2024

The main differences are:

  • ICU4X allows _ as a separator, ECMA402 does not
  • ICU4X language subtags are only 2 or 3 characters, ECMA402 allows longer
  • ICU4X treats -u-kw-true as identical to -u-kw, which may cause serialization differences. This should not matter practically.

ECMA402 and ICU4X agree on not starting locales with scripts. They also agree on no duplicate singleton subtags.

ICU4X does not error on, but will merge entries for duplicate unicode attributes and transform fields. ECMA402 requires that transform language variants have no duplicates.

I think what we should do is this:

  • Move ICU4X away from _ in locale parsing. People wishing to use ICU4X with legacy locales can perform the replacement, or we can provide a legacy parsing function
  • Fix the ICU4X bug
  • Update ECMA402 to not allow long language subtags. They're not used yet.
  • Update ICU4X to have the "no duplicate language variants in transform" constraint, it seems reasonable.

For aliases, I do not think this is a parsing concern. ECMA402 implementations should canonicalize during parsing, as documented in Intl.Locale(). ICU4X may need to learn how to canonicalize calendars.

@sffc
Copy link
Member

sffc commented Dec 21, 2024

402 issue for removing support for long language subtags: tc39/ecma402#951

@sffc
Copy link
Member

sffc commented Dec 26, 2024

Thought: if we wanted to support long language subtags, we could do it without impacting the stack size of LanguageIdentifier by stuffing it in the Variants list and adding an enum variant to the Language field (there should be enough niches for that).

@Manishearth
Copy link
Member

@sffc yeah I was considering that. I couldn't come up with a great design for it but there are hacks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-locale Component: Locale identifiers, BCP47 S-small Size: One afternoon (small bug fix or enhancement) T-docs-tests Type: Code change outside core library U-ecma402 User: ECMA-402 compatibility
Projects
None yet
Development

No branches or pull requests

4 participants