Skip to content

Commit

Permalink
Exclude RtoL characters from paired string delimiters
Browse files Browse the repository at this point in the history
Fixes #22228

Some scripts in the world are written right-to-left, such as Arabic and
Hebrew.  This can result in confusion for quote-like string delimitters
that we have chosen based on left-to_right.  Therefore exclude all such.
Currently, the only pair that falls into this category that we don't
exclude for other reasons are SYRIAC COLON SKEWED LEFT/RIGHT.
  • Loading branch information
khwilliamson authored and book committed May 23, 2024
1 parent 8502140 commit f3e2f6b
Show file tree
Hide file tree
Showing 4 changed files with 37 additions and 11 deletions.
2 changes: 0 additions & 2 deletions pod/perlop.pod
Original file line number Diff line number Diff line change
Expand Up @@ -3850,7 +3850,6 @@ The complete list of accepted paired delimiters as of Unicode 14.0 is:
{ } U+007B, U+007D LEFT/RIGHT CURLY BRACKET
« » U+00AB, U+00BB LEFT/RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
» « U+00BB, U+00AB RIGHT/LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
܆ ܇ U+0706, U+0707 SYRIAC COLON SKEWED LEFT/RIGHT
༺ ༻ U+0F3A, U+0F3B TIBETAN MARK GUG RTAGS GYON, TIBETAN MARK GUG
RTAGS GYAS
༼ ༽ U+0F3C, U+0F3D TIBETAN MARK ANG KHANG GYON, TIBETAN MARK ANG
Expand Down Expand Up @@ -4231,5 +4230,4 @@ The complete list of accepted paired delimiters as of Unicode 14.0 is:
🢩 🢨 U+1F8A9, U+1F8A8 RIGHT/LEFTWARDS BACK-TILTED SHADOWED WHITE ARROW
🢫 🢪 U+1F8AB, U+1F8AA RIGHT/LEFTWARDS FRONT-TILTED SHADOWED WHITE
ARROW

=cut
10 changes: 10 additions & 0 deletions regen/unicode_constants.pl
Original file line number Diff line number Diff line change
Expand Up @@ -378,6 +378,7 @@ END
my $illegal = "Mirror illegal";
my $no_encoded_mate = "Mirrored, but Unicode has no encoded mirror";
my $bidirectional = "Bidirectional";
my $r2l = "Is in a Right to Left script";

my %unused_bidi_pairs;
my %inverted_unused_bidi_pairs;
Expand Down Expand Up @@ -634,6 +635,15 @@ END
next;
}

# Exclude characters that are R to L ordering, as this can cause
# confusion. See GH #22228
if ($chr =~ / (?[ \p{Bidi_Class:R} + \p{Bidi_Class:AL} ]) /x) {
$discards{$code_point} = { reason => $r2l,
mirror => $mirror_code_point
};
next;
}

# We enter the pair with the original code point on the left; if it
# should instead be on the R, swap. Most Symbols that contain the
# word REVERSE go on the rhs, except those whose names explicitly
Expand Down
18 changes: 18 additions & 0 deletions t/lib/croak/toke
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,24 @@ EXPECT
Use of '«' is deprecated as a string delimiter at - line 3.
Can't find string terminator "«" anywhere before EOF at - line 5.
########
# NAME mirrored delimiters in R-to-L scripts are invalid
BEGIN { binmode STDERR, ":utf8" }
use utf8;
use feature 'extra_paired_delimiters';
my $good = q܈this string is delimitted by a symbol in a R-to-L script܈;
$good = q܇this string is delimitted by a symbol in a R-to-L script܇;
my $bad = q܈Can't use mirrored R-to-L script delimiters܇;
EXPECT
Can't find string terminator "܈" anywhere before EOF at - line 6.
########
# NAME mirrored delimiters in R-to-L scripts are invalid in the other order too
BEGIN { binmode STDERR, ":utf8" }
use utf8;
use feature 'extra_paired_delimiters';
my $bad = q܇Can't use mirrored R-to-L script delimiters܈;
EXPECT
Can't find string terminator "܇" anywhere before EOF at - line 4.
########
# NAME paired above Latin1 delimiters need feature enabled
BEGIN { binmode STDERR, ":utf8" }
use utf8;
Expand Down
Loading

0 comments on commit f3e2f6b

Please sign in to comment.