-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document icu_locid's relationship to backwards compatibility syntax #3989
Comments
CC @zbraniecki |
The "BCP 47 Conformance" section says
There is this fn in Test262: There are also some tests in that file. I think the next steps on this issue are:
|
We can see the current conformance in Boa's suite - https://boajs.dev/conformance |
We're failing the following three tests:
Here are the missed failures from (1):
|
|
In (2) the errors are:
|
The (3) is just |
I'll mark it for discussion on how to address it. |
Decisions to be made: 1. How do we want to handle
|
I recall a discussion from a little while ago where we said we would accept _ and Boa should just check for the _, and otherwise we would be compliant. I'm okay with that, but we should document clearly with some examples and test cases. For alias resolution, that should be the job of the LocaleCanonicalizer, I think. |
|
Filed https://bugzilla.mozilla.org/show_bug.cgi?id=1938524 . I encourage pursuing a use counter for other engines. |
I documented the differences in terms of EBNF syntax UTS 35unicode_language_id = "root" | (unicode_language_subtag (sep unicode_script_subtag)? | unicode_script_subtag) (sep unicode_region_subtag)? (sep unicode_variant_subtag)* ;
unicode_language_subtag = alpha{2,3} | alpha{5,8};
unicode_script_subtag = alpha{4} ;
unicode_region_subtag = (alpha{2} | digit{3}) ;
unicode_variant_subtag = (alphanum{5,8} | digit alphanum{3}) ;
sep = [-_] ;
digit = [0-9] ;
alpha = [A-Z a-z] ;
alphanum = [0-9 A-Z a-z] ;
unicode_locale_id = unicode_language_id extensions* pu_extensions? ;
extensions = unicode_locale_extensions | transformed_extensions | other_extensions ;
unicode_locale_extensions = sep [uU] ((sep keyword)+ |(sep attribute)+ (sep ufield)*) ;
transformed_extensions = sep [tT] ((sep tlang (sep tfield)*) | (sep tfield)+) ;
pu_extensions = sep [xX] (sep alphanum{1,8})+ ;
other_extensions = sep [alphanum-[tTuUxX]] (sep alphanum{2,8})+ ;
ufield / keyword = ukey (sep uvalue)? ;
ukey / key = alphanum alpha ;
uvalue / type = alphanum{3,8} (sep alphanum{3,8})* ;
attribute = alphanum{3,8} ;
unicode_subdivision_id = unicode_region_subtag unicode_subdivision_suffix ;
unicode_subdivision_suffix = alphanum{1,4} ;
unicode_measure_unit = alphanum{3,8} (sep alphanum{3,8})* ;
tlang = unicode_language_subtag (sep unicode_script_subtag)? (sep unicode_region_subtag)? (sep unicode_variant_subtag)* ;
tfield = tkey tvalue;
tkey = alpha digit ;
tvalue = alphanum{3,8} (sep alphanum{3,8})+ ; Notable quirk: pu_extensions will include 1-length subtags, so UTS 35 without backcompat as used by ECMA402# All other nonterminals from UTS 35 above
# The "_" character for field separator characters, as well as the "-" used in [BCP47] (however, the canonical form is with "-")
sep = [-] ;
# The subtag "root" to indicate the generic locale used as the parent of all languages in the CLDR data model ("und" can be used instead)
# The language tag may begin with a script subtag rather than a language subtag. This is specialized use only, and not required for CLDR conformance.
unicode_language_id = unicode_language_subtag (sep unicode_script_subtag)? (sep unicode_region_subtag)? (sep unicode_variant_subtag)* ;
# Simplification
tlang = unicode_language_id Further ECMA402 constraints:
ICU4X# All other nonterminals from UTS 35 above
# No "root" locale
# Does not allow starting with a script, MUST use und
unicode_language_id = unicode_language_subtag (sep unicode_script_subtag)? (sep unicode_region_subtag)? (sep unicode_variant_subtag)* ;
# Does not support larger language subtags
unicode_language_subtag = alpha{2,3} ;
# Simplification
tlang = unicode_language_id
# These are non syntactic distinctions that affect the format of the parse result
# ICU4X just doesn't distinguish between ufield and keyword
unicode_locale_extensions = sep [uU] ((sep keyword)+ |(sep attribute)+ (sep keyword)*) ;
# ICU4X doesn't actually *store* values that are called "true", so `u-xy-true` just parses as a keyword `xy`.
uvalue / type = ("true" | alphanum{3,8}) (sep ("true" | alphanum{3,8}))* Further ICU4X constraints:
|
The main differences are:
ECMA402 and ICU4X agree on not starting locales with scripts. They also agree on no duplicate singleton subtags. ICU4X does not error on, but will merge entries for duplicate unicode attributes and transform fields. ECMA402 requires that transform language variants have no duplicates. I think what we should do is this:
For aliases, I do not think this is a parsing concern. ECMA402 implementations should canonicalize during parsing, as documented in |
402 issue for removing support for long language subtags: tc39/ecma402#951 |
Thought: if we wanted to support long language subtags, we could do it without impacting the stack size of LanguageIdentifier by stuffing it in the |
@sffc yeah I was considering that. I couldn't come up with a great design for it but there are hacks. |
ECMA-402 IsStructurallyValidLanguageTag says:
From reading https://unicode-org.github.io/icu4x/docs/icu_locid/struct.Locale.html , it's unclear to me what the correspondence of
icu_locid::Locale::try_from_bytes
to IsStructurallyValidLanguageTag is. The docs should say if backwards compatibility syntax is allowed or not.The text was updated successfully, but these errors were encountered: