Skip to content

Commit

Permalink
textseg: Update tables for Unicode 13.
Browse files Browse the repository at this point in the history
Unicode 13's version of UAX #29 still uses the same segmentation algorithm
as Unicode 12, but has some changes to the character class tables that
the algorithm is given in terms of.

The test tables haven't been updated in Unicode 13 and are still passing,
and so the new tables seem to be broadly compatible with the Unicode 12
behavior and unlikely to cause significant problems for applications
previously built for Unicode 12's rules.
  • Loading branch information
apparentlymart committed Feb 22, 2021
1 parent ffeefd9 commit 5b41aa2
Show file tree
Hide file tree
Showing 9 changed files with 3,396 additions and 3,081 deletions.
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,15 +15,15 @@ multiple callers may coexist in the same program, there is a separate
major release of this module for each supported major Unicode version.
Therefore you can select the specific version you want by module
path. For example, to use the algorithm and tables defined by Unicode
version 12:
version 13:

```
go get github.com/apparentlymart/go-textseg/v12
go get github.com/apparentlymart/go-textseg/v13
```

```go
import (
"github.com/apparentlymart/go-textseg/v12/textseg"
"github.com/apparentlymart/go-textseg/v13/textseg"
)
```

Expand Down
4 changes: 2 additions & 2 deletions go.mod
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
module github.com/apparentlymart/go-textseg/v12
module github.com/apparentlymart/go-textseg/v13

go 1.14
go 1.16
773 changes: 504 additions & 269 deletions textseg/emoji_table.rl

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions textseg/generate.go
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ package textseg

//go:generate go run make_tables.go -output tables.go
//go:generate go run make_test_tables.go -output tables_test.go
//go:generate ruby unicode2ragel.rb --url=https://www.unicode.org/Public/12.0.0/ucd/auxiliary/GraphemeBreakProperty.txt -m GraphemeCluster -p "Prepend,CR,LF,Control,Extend,Regional_Indicator,SpacingMark,L,V,T,LV,LVT,ZWJ" -o grapheme_clusters_table.rl
//go:generate ruby unicode2ragel.rb --url=https://www.unicode.org/Public/emoji/12.0/emoji-data.txt -m Emoji -p "Extended_Pictographic" -o emoji_table.rl
//go:generate ruby unicode2ragel.rb --url=https://www.unicode.org/Public/13.0.0/ucd/auxiliary/GraphemeBreakProperty.txt -m GraphemeCluster -p "Prepend,CR,LF,Control,Extend,Regional_Indicator,SpacingMark,L,V,T,LV,LVT,ZWJ" -o grapheme_clusters_table.rl
//go:generate ruby unicode2ragel.rb --url=https://www.unicode.org/Public/13.0.0/ucd/emoji/emoji-data.txt -m Emoji -p "Extended_Pictographic" -o emoji_table.rl
//go:generate ragel -Z grapheme_clusters.rl
//go:generate gofmt -w grapheme_clusters.go
5,660 changes: 2,860 additions & 2,800 deletions textseg/grapheme_clusters.go

Large diffs are not rendered by default.

24 changes: 22 additions & 2 deletions textseg/grapheme_clusters_table.rl
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# The following Ragel file was autogenerated with unicode2ragel.rb
# from: https://www.unicode.org/Public/12.0.0/ucd/auxiliary/GraphemeBreakProperty.txt
# from: https://www.unicode.org/Public/13.0.0/ucd/auxiliary/GraphemeBreakProperty.txt
#
# It defines ["Prepend", "CR", "LF", "Control", "Extend", "Regional_Indicator", "SpacingMark", "L", "V", "T", "LV", "LVT", "ZWJ"].
#
Expand All @@ -18,6 +18,8 @@
| 0xF0 0x91 0x82 0xBD #Cf KAITHI NUMBER SIGN
| 0xF0 0x91 0x83 0x8D #Cf KAITHI NUMBER SIGN ABOVE
| 0xF0 0x91 0x87 0x82..0x83 #Lo [2] SHARADA SIGN JIHVAMULIYA..SHARA...
| 0xF0 0x91 0xA4 0xBF #Lo DIVES AKURU PREFIXED NASAL SIGN
| 0xF0 0x91 0xA5 0x81 #Lo DIVES AKURU INITIAL RA
| 0xF0 0x91 0xA8 0xBA #Lo ZANABAZAR SQUARE CLUSTER-INITIAL L...
| 0xF0 0x91 0xAA 0x84..0x89 #Lo [6] SOYOMBO SIGN JIHVAMULIYA..SOYOM...
| 0xF0 0x91 0xB5 0x86 #Lo MASARAM GONDI REPHA
Expand Down Expand Up @@ -130,7 +132,7 @@
| 0xE0 0xAC 0xBF #Mn ORIYA VOWEL SIGN I
| 0xE0 0xAD 0x81..0x84 #Mn [4] ORIYA VOWEL SIGN U..ORIYA VOWEL SI...
| 0xE0 0xAD 0x8D #Mn ORIYA SIGN VIRAMA
| 0xE0 0xAD 0x96 #Mn ORIYA AI LENGTH MARK
| 0xE0 0xAD 0x95..0x96 #Mn [2] ORIYA SIGN OVERLINE..ORIYA AI LENG...
| 0xE0 0xAD 0x97 #Mc ORIYA AU LENGTH MARK
| 0xE0 0xAD 0xA2..0xA3 #Mn [2] ORIYA VOWEL SIGN VOCALIC L..ORIYA ...
| 0xE0 0xAE 0x82 #Mn TAMIL SIGN ANUSVARA
Expand Down Expand Up @@ -161,6 +163,7 @@
| 0xE0 0xB5 0x8D #Mn MALAYALAM SIGN VIRAMA
| 0xE0 0xB5 0x97 #Mc MALAYALAM AU LENGTH MARK
| 0xE0 0xB5 0xA2..0xA3 #Mn [2] MALAYALAM VOWEL SIGN VOCALIC L..MA...
| 0xE0 0xB6 0x81 #Mn SINHALA SIGN CANDRABINDU
| 0xE0 0xB7 0x8A #Mn SINHALA SIGN AL-LAKUNA
| 0xE0 0xB7 0x8F #Mc SINHALA VOWEL SIGN AELA-PILLA
| 0xE0 0xB7 0x92..0x94 #Mn [3] SINHALA VOWEL SIGN KETTI IS-PILLA....
Expand Down Expand Up @@ -221,6 +224,8 @@
| 0xE1 0xA9 0xBF #Mn TAI THAM COMBINING CRYPTOGRAMMIC DOT
| 0xE1 0xAA 0xB0..0xBD #Mn [14] COMBINING DOUBLED CIRCUMFLEX ACCEN...
| 0xE1 0xAA 0xBE #Me COMBINING PARENTHESES OVERLAY
| 0xE1 0xAA 0xBF..0xFF #Mn [2] COMBINING LATIN SMALL LETTER W BEL...
| 0xE1 0xAB 0x00..0x80 #
| 0xE1 0xAC 0x80..0x83 #Mn [4] BALINESE SIGN ULU RICEM..BALINESE ...
| 0xE1 0xAC 0xB4 #Mn BALINESE SIGN REREKAN
| 0xE1 0xAC 0xB5 #Mc BALINESE VOWEL SIGN TEDUNG
Expand Down Expand Up @@ -267,6 +272,7 @@
| 0xEA 0xA0 0x86 #Mn SYLOTI NAGRI SIGN HASANTA
| 0xEA 0xA0 0x8B #Mn SYLOTI NAGRI SIGN ANUSVARA
| 0xEA 0xA0 0xA5..0xA6 #Mn [2] SYLOTI NAGRI VOWEL SIGN U..SYLOTI ...
| 0xEA 0xA0 0xAC #Mn SYLOTI NAGRI SIGN ALTERNATE HASANTA
| 0xEA 0xA3 0x84..0x85 #Mn [2] SAURASHTRA SIGN VIRAMA..SAURASHTRA...
| 0xEA 0xA3 0xA0..0xB1 #Mn [18] COMBINING DEVANAGARI DIGIT ZERO..C...
| 0xEA 0xA3 0xBF #Mn DEVANAGARI VOWEL SIGN AY
Expand Down Expand Up @@ -307,6 +313,7 @@
| 0xF0 0x90 0xA8 0xBF #Mn KHAROSHTHI VIRAMA
| 0xF0 0x90 0xAB 0xA5..0xA6 #Mn [2] MANICHAEAN ABBREVIATION MARK AB...
| 0xF0 0x90 0xB4 0xA4..0xA7 #Mn [4] HANIFI ROHINGYA SIGN HARBAHAY.....
| 0xF0 0x90 0xBA 0xAB..0xAC #Mn [2] YEZIDI COMBINING HAMZA MARK..YE...
| 0xF0 0x90 0xBD 0x86..0x90 #Mn [11] SOGDIAN COMBINING DOT BELOW..SO...
| 0xF0 0x91 0x80 0x81 #Mn BRAHMI SIGN ANUSVARA
| 0xF0 0x91 0x80 0xB8..0xFF #Mn [15] BRAHMI VOWEL SIGN AA..BRAHMI VI...
Expand All @@ -322,6 +329,7 @@
| 0xF0 0x91 0x86 0x80..0x81 #Mn [2] SHARADA SIGN CANDRABINDU..SHARA...
| 0xF0 0x91 0x86 0xB6..0xBE #Mn [9] SHARADA VOWEL SIGN U..SHARADA V...
| 0xF0 0x91 0x87 0x89..0x8C #Mn [4] SHARADA SANDHI MARK..SHARADA EX...
| 0xF0 0x91 0x87 0x8F #Mn SHARADA SIGN INVERTED CANDRABINDU
| 0xF0 0x91 0x88 0xAF..0xB1 #Mn [3] KHOJKI VOWEL SIGN U..KHOJKI VOW...
| 0xF0 0x91 0x88 0xB4 #Mn KHOJKI SIGN ANUSVARA
| 0xF0 0x91 0x88 0xB6..0xB7 #Mn [2] KHOJKI SIGN NUKTA..KHOJKI SIGN ...
Expand Down Expand Up @@ -365,6 +373,10 @@
| 0xF0 0x91 0x9C 0xA7..0xAB #Mn [5] AHOM VOWEL SIGN AW..AHOM SIGN K...
| 0xF0 0x91 0xA0 0xAF..0xB7 #Mn [9] DOGRA VOWEL SIGN U..DOGRA SIGN ...
| 0xF0 0x91 0xA0 0xB9..0xBA #Mn [2] DOGRA SIGN VIRAMA..DOGRA SIGN N...
| 0xF0 0x91 0xA4 0xB0 #Mc DIVES AKURU VOWEL SIGN AA
| 0xF0 0x91 0xA4 0xBB..0xBC #Mn [2] DIVES AKURU SIGN ANUSVARA..DIVE...
| 0xF0 0x91 0xA4 0xBE #Mn DIVES AKURU VIRAMA
| 0xF0 0x91 0xA5 0x83 #Mn DIVES AKURU SIGN NUKTA
| 0xF0 0x91 0xA7 0x94..0x97 #Mn [4] NANDINAGARI VOWEL SIGN U..NANDI...
| 0xF0 0x91 0xA7 0x9A..0x9B #Mn [2] NANDINAGARI VOWEL SIGN E..NANDI...
| 0xF0 0x91 0xA7 0xA0 #Mn NANDINAGARI SIGN VIRAMA
Expand Down Expand Up @@ -397,6 +409,7 @@
| 0xF0 0x96 0xAC 0xB0..0xB6 #Mn [7] PAHAWH HMONG MARK CIM TUB..PAHA...
| 0xF0 0x96 0xBD 0x8F #Mn MIAO SIGN CONSONANT MODIFIER BAR
| 0xF0 0x96 0xBE 0x8F..0x92 #Mn [4] MIAO TONE RIGHT..MIAO TONE BELOW
| 0xF0 0x96 0xBF 0xA4 #Mn KHITAN SMALL SCRIPT FILLER
| 0xF0 0x9B 0xB2 0x9D..0x9E #Mn [2] DUPLOYAN THICK LETTER SELECTOR....
| 0xF0 0x9D 0x85 0xA5 #Mc MUSICAL SYMBOL COMBINING STEM
| 0xF0 0x9D 0x85 0xA7..0xA9 #Mn [3] MUSICAL SYMBOL COMBINING TREMOL...
Expand Down Expand Up @@ -548,6 +561,7 @@
| 0xF0 0x91 0x86 0xB3..0xB5 #Mc [3] SHARADA VOWEL SIGN AA..SHARADA ...
| 0xF0 0x91 0x86 0xBF..0xFF #Mc [2] SHARADA VOWEL SIGN AU..SHARADA ...
| 0xF0 0x91 0x87 0x00..0x80 #
| 0xF0 0x91 0x87 0x8E #Mc SHARADA VOWEL SIGN PRISHTHAMATRA E
| 0xF0 0x91 0x88 0xAC..0xAE #Mc [3] KHOJKI VOWEL SIGN AA..KHOJKI VO...
| 0xF0 0x91 0x88 0xB2..0xB3 #Mc [2] KHOJKI VOWEL SIGN O..KHOJKI VOW...
| 0xF0 0x91 0x88 0xB5 #Mc KHOJKI SIGN VIRAMA
Expand Down Expand Up @@ -579,6 +593,11 @@
| 0xF0 0x91 0x9C 0xA6 #Mc AHOM VOWEL SIGN E
| 0xF0 0x91 0xA0 0xAC..0xAE #Mc [3] DOGRA VOWEL SIGN AA..DOGRA VOWE...
| 0xF0 0x91 0xA0 0xB8 #Mc DOGRA SIGN VISARGA
| 0xF0 0x91 0xA4 0xB1..0xB5 #Mc [5] DIVES AKURU VOWEL SIGN I..DIVES...
| 0xF0 0x91 0xA4 0xB7..0xB8 #Mc [2] DIVES AKURU VOWEL SIGN AI..DIVE...
| 0xF0 0x91 0xA4 0xBD #Mc DIVES AKURU SIGN HALANTA
| 0xF0 0x91 0xA5 0x80 #Mc DIVES AKURU MEDIAL YA
| 0xF0 0x91 0xA5 0x82 #Mc DIVES AKURU MEDIAL RA
| 0xF0 0x91 0xA7 0x91..0x93 #Mc [3] NANDINAGARI VOWEL SIGN AA..NAND...
| 0xF0 0x91 0xA7 0x9C..0x9F #Mc [4] NANDINAGARI VOWEL SIGN O..NANDI...
| 0xF0 0x91 0xA7 0xA4 #Mc NANDINAGARI VOWEL SIGN PRISHTHAMAT...
Expand All @@ -596,6 +615,7 @@
| 0xF0 0x91 0xBB 0xB5..0xB6 #Mc [2] MAKASAR VOWEL SIGN E..MAKASAR V...
| 0xF0 0x96 0xBD 0x91..0xFF #Mc [55] MIAO SIGN ASPIRATION..MIAO VOWE...
| 0xF0 0x96 0xBE 0x00..0x87 #
| 0xF0 0x96 0xBF 0xB0..0xB1 #Mc [2] VIETNAMESE ALTERNATE READING MA...
| 0xF0 0x9D 0x85 0xA6 #Mc MUSICAL SYMBOL COMBINING SPRECHGES...
| 0xF0 0x9D 0x85 0xAD #Mc MUSICAL SYMBOL COMBINING AUGMENTAT...
;
Expand Down
2 changes: 1 addition & 1 deletion textseg/make_test_tables.go
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ import (
)

var url = flag.String("url",
"http://www.unicode.org/Public/12.0.0/ucd/auxiliary/",
"http://www.unicode.org/Public/13.0.0/ucd/auxiliary/",
"URL of Unicode database directory")
var verbose = flag.Bool("verbose",
false,
Expand Down
2 changes: 1 addition & 1 deletion textseg/tables_test.go
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
// Generated by running
// maketesttables --url=http://www.unicode.org/Public/12.0.0/ucd/auxiliary/
// maketesttables --url=http://www.unicode.org/Public/13.0.0/ucd/auxiliary/
// DO NOT EDIT

package textseg
Expand Down
2 changes: 1 addition & 1 deletion textseg/unicode2ragel.rb
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@
# range and description.

def each_alpha( url, property )
open( url ) do |file|
URI.open( url ) do |file|
file.each_line do |line|
next if line =~ /^#/;
next if line !~ /; #{property} *#/;
Expand Down

0 comments on commit 5b41aa2

Please sign in to comment.