Skip to content

Commit

Permalink
More precisely describe complement classes and case-insensitive match…
Browse files Browse the repository at this point in the history
…ing (#38012)
  • Loading branch information
Josh-Cena authored Feb 15, 2025
1 parent bcff6fc commit d9e1eba
Show file tree
Hide file tree
Showing 3 changed files with 11 additions and 11 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,9 @@ console.log(regex2.test("Football"));

If the regex is [Unicode-aware](/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/unicode#unicode-aware_mode), the case mapping happens through _simple case folding_ specified in [`CaseFolding.txt`](https://unicode.org/Public/UCD/latest/ucd/CaseFolding.txt). The mapping always maps to a single code point, so it does not map, for example, `ß` (U+00DF LATIN SMALL LETTER SHARP S) to `ss` (which is _full case folding_, not _simple case folding_). It may however map code points outside the Basic Latin block to code points within it — for example, `ſ` (U+017F LATIN SMALL LETTER LONG S) case-folds to `s` (U+0073 LATIN SMALL LETTER S) and `` (U+212A KELVIN SIGN) case-folds to `k` (U+006B LATIN SMALL LETTER K). Therefore, `ſ` and `` can be matched by `/[a-z]/ui`.

If the regex is Unicode-unaware, case mapping uses the [Unicode Default Case Conversion](https://unicode-org.github.io/icu/userguide/transforms/casemappings.html) — the same algorithm used in {{jsxref("String.prototype.toUpperCase()")}}. For example, `` (U+2126 OHM SIGN) and `Ω` (U+03A9 GREEK CAPITAL LETTER OMEGA) are both mapped by Default Case Conversion to themselves but by simple case folding to `ω` (U+03C9 GREEK SMALL LETTER OMEGA), so `"ω"` is matched by `/[\u2126]/ui` and `/[\u03a9]/ui` but not by `/[\u2126]/i` or `/[\u03a9]/i`. This algorithm prevents code points outside the Basic Latin block to be mapped to code points within it, so `ſ` and `` mentioned previously are not matched by `/[a-z]/i`.
If the regex is Unicode-unaware, case mapping uses the [Unicode Default Case Conversion](https://unicode-org.github.io/icu/userguide/transforms/casemappings.html) — the same algorithm used in {{jsxref("String.prototype.toUpperCase()")}}. This algorithm prevents code points outside the Basic Latin block to be mapped to code points within it, so `ſ` and `` mentioned previously are not matched by `/[a-z]/i`.

Unicode-aware case folding generally folds to lower case, while Unicode-unaware case folding folds to upper case. These two are not perfect reverse operations, so there are some subtle behavior differences. For example, `` (U+2126 OHM SIGN) and `Ω` (U+03A9 GREEK CAPITAL LETTER OMEGA) are both mapped by simple case folding to `ω` (U+03C9 GREEK SMALL LETTER OMEGA), so `"\u2126"` is matched by `/[\u03c9]/ui` and `/[\u03a9]/ui`; on the other hand, U+2126 is mapped by Default Case Conversion to itself, while the other two both map to U+03A9, so `"\u2126"` is matched by neither `/[\u03c9]/i` nor `/[\u03a9]/i`.

The set accessor of `ignoreCase` is `undefined`. You cannot change this property directly.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -96,25 +96,21 @@ Complement character classes `[^...]` cannot possibly be able to match strings l

### Complement classes and case-insensitive matching

In non-`v`-mode, complement character classes `[^...]` are implemented by simply inverting the match result — that is, `[^...]` matches whenever `[...]` doesn't match, and vice versa. However, the other complement classes, such as `\P{...}` and `\W`, work by eagerly constructing the set consisting of all characters without the specified property. They seem to produce the same behavior, but are made more complex when combined with [case-insensitive](/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/ignoreCase) matching.
[Case-insensitive](/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/ignoreCase) matching works by case-folding both the expected character set and the matched string. When specifying complement classes, the order in which JavaScript performs case-folding and complementing is important. In brief, `[^...]` in `u` mode matches `allCharacters - caseFold(original)`, while in `v` mode matches `caseFold(allCharacters) - caseFold(original)`. This ensures that all complement class syntaxes, including `[^...]`, `\P`, `\W`, etc., cancel each other out.

Consider the following two regexes:
Consider the following two regexes (to simplify things, let's assume that Unicode characters are one of three kinds: lowercase, uppercase, and caseless, and each uppercase letter has a unique lowercase counterpart, and vice versa):

```js
const r1 = /\p{Lowercase_Letter}/iu;
const r2 = /[^\P{Lowercase_Letter}]/iu;
```

The `r2` is a double negation and seems to be equivalent with `r1`. But in fact, `r1` matches all lower- and upper-case ASCII letters, while `r2` matches none. To illustrate how it works, pretend that we are only dealing with ASCII characters, not the entire Unicode character set, and `r1` and `r2` are specified as below:
The `r2` is a double negation and seems to be equivalent with `r1`. But in fact, `r1` matches all lower- and uppercase ASCII letters, while `r2` matches none. Here's a step-by-step explanation:

```js
const r1 = /[a-z]/iu;
const r2 = /[^A-Z]/iu;
```

Recall that case-insensitive matching happens by folding both the pattern and the input to the same case (see {{jsxref("RegExp/ignoreCase", "ignoreCase")}} for more details). For `r1`, the character class `a-z` stays the same after case folding, while both upper- and lower-case ASCII string inputs are folded to lower-case, so `r1` is able to match both `"A"` and `"a"`. For `r2`, the character class `A-Z` is folded to `a-z`; however, `^` negates the match result, so that `[^A-Z]` in effect only matches upper-case strings. However, both upper- and lower-case ASCII string inputs are still folded to lower-case, causing `r2` to match nothing.
- In `r1`, `\p{Lowercase_Letter}` constructs a set of all lowercase characters. Characters in this set are then case-folded to their lowercase form, so they stay the same. The input string is also case-folded to lowercase. Therefore, `"A"` and `"a"` are both folded to `"a"` and matched by `r1`.
- In `r2`, `\P{Lowercase_Letter}` first constructs a set of all non-lowercase characters, i.e., uppercase letters and caseless characters. Characters in this set are then case-folded to their lowercase form, so the character set becomes all lowercase letters and caseless characters. `[^...]` negates the match, causing it to match anything that's _not_ in this set, i.e., an uppercase letter. However, the input is still case-folded to lowercase, so `"A"` is folded to `"a"` and not matched by `r2`.

In `v` mode, this behavior is fixed — `[^...]` also eagerly constructs the complement class instead of negating the match result. This makes `[^\P{Lowercase_Letter}]` and `\p{Lowercase_Letter}` are strictly equivalent.
The main observation here is that after `[^...]` negates the match, the expected character set may not be a subset of the set of case-folded Unicode characters, causing the case-folded input to not be in the expected character set. In `v` mode, the set of all characters is also case-folded. The `\P` character class itself also works slightly differently in `v` mode (see [Unicode character class escape](/en-US/docs/Web/JavaScript/Reference/Regular_expressions/Unicode_character_class_escape)). All of these ensure that `[^\P{Lowercase_Letter}]` and `\p{Lowercase_Letter}` are strictly equivalent.

## Examples

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,8 @@ console.log(sentence.match(regexpCurrencyOrPunctuation));

Every Unicode character has a set of properties that describe it. For example, the character [`a`](https://util.unicode.org/UnicodeJsps/character.jsp?a=0061) has the `General_Category` property with value `Lowercase_Letter`, and the `Script` property with value `Latn`. The `\p` and `\P` escape sequences allow you to match a character based on its properties. For example, `a` can be matched by `\p{Lowercase_Letter}` (the `General_Category` property name is optional) as well as `\p{Script=Latn}`. `\P` creates a _complement class_ that consists of code points without the specified property.

When the [`i`](/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/ignoreCase) flag is set, `\P` character classes are handled slightly differently in `u` and `v` modes. In `u` mode, case-folding happens after subtraction; in `v` mode, case-folding happens before subtraction. More concretely, in `u` mode, `\P{property}` matches `caseFold(allCharacters - charactersWithProperty)`. This means `/\P{Lowercase_Letter}/iu` still matches `"a"`, because `A` is not a `Lowercase_Letter`. In `v` mode, `\P{property}` matches `caseFold(allCharacters) - caseFold(charactersWithProperty)`. This means `/\P{Lowercase_Letter}/iv` does not match `"a"`, because `A` is not even in the set of all case-folded Unicode characters. See also [complement classes and case-insensitive matching](/en-US/docs/Web/JavaScript/Reference/Regular_expressions/Character_class#complement_classes_and_case-insensitive_matching).

To compose multiple properties, use the [character set intersection](/en-US/docs/Web/JavaScript/Reference/Regular_expressions/Character_class#v-mode_character_class) syntax enabled with the `v` flag, or see [pattern subtraction and intersection](/en-US/docs/Web/JavaScript/Reference/Regular_expressions/Lookahead_assertion#pattern_subtraction_and_intersection).

In `v` mode, `\p` may match a sequence of code points, defined in Unicode as "properties of strings". This is most useful for emojis, which are often composed of multiple code points. However, `\P` can only complement character properties.
Expand Down

0 comments on commit d9e1eba

Please sign in to comment.