Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/text/collate: Compare and CompareString broken in "ka-shifted" collation mode #68166

Open
danderson opened this issue Jun 25, 2024 · 12 comments · May be fixed by golang/text#54
Open

x/text/collate: Compare and CompareString broken in "ka-shifted" collation mode #68166

danderson opened this issue Jun 25, 2024 · 12 comments · May be fixed by golang/text#54
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@danderson
Copy link
Contributor

Go version

go version go1.22.3 linux/amd64

Output of go env in your module/workspace:

GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/home/dave/.cache/go-build'
GOENV='/home/dave/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/home/dave/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/home/dave/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/nix/store/00mg4vlhzmm7gi9bd5v5ydjlgrywpc3n-go-1.22.3/share/go'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/nix/store/00mg4vlhzmm7gi9bd5v5ydjlgrywpc3n-go-1.22.3/share/go/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.22.3'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='/home/dave/hack/psl/tools/go.mod'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build1869424909=/tmp/go-build -gno-record-gcc-switches'

What did you do?

Tried to compare strings with the "shifted" variable weighting option, which you can select with e.g. language tag en-u-ka-shifted.

Demo: https://go.dev/play/p/afAOLzqh3Oe

What did you see happen?

99% of Compare() and CompareString() calls on different strings return 0.

What did you expect to see?

Comparison results that match the configured collation, or at least a courtesy panic letting me know it's known broken (spoiler alert...)

@gopherbot gopherbot added this to the Unreleased milestone Jun 25, 2024
@gabyhelp
Copy link

Similar Issues

(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)

@danderson
Copy link
Contributor Author

Pretty simple bug: https://cs.opensource.google/go/x/text/+/refs/tags/v0.16.0:collate/collate.go;l=166 . When the collator is in the shifted alternate mode, compare() silently skips the entire primary level, and thus the comparisons are == 0 always when comparing ascii. They're occasionally != 0 if you get lucky and the secondary/tertiary levels have tie-breakers.

Looking at the history, x/text/collate seems unmaintained (last change that wasn't fmts or Unicode upgrades was 2016 :( ). Given this, my suggestion tactically would be to rip out the custom codepaths for Compare/CompareString, and have them generate and bytes.Compare full collation keys. It's less efficient for one-off compares, but I think it would be correct, and a future maintainer can always make it faster again since it's all internal package gubbins.

@ianlancetaylor
Copy link
Member

CC @mpvl

@joedian joedian added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jun 25, 2024
@danderson
Copy link
Contributor Author

I hacked up a more comprehensive debugging tool: https://go.dev/play/p/GcsuboreVW8 . For a bunch of comparisons and collations, it prints:

  • The result of bytes.Compare() over the collation keys obtained from coll.KeyFromString()
  • The result of coll.CompareString()
  • The raw collation keys, formatted to line up nicely and highlight where the different collation levels are.

There's also main() in there suitable for running from the CLI as well.

What I reported above can be seen in the FULL vs. ITER of collations en and en-u-ka-shifted: the latter reports all inputs as equivalent.

Further weirdness I discovered while writing the tool:

  • If you also request numeric sorting, with en-u-kn-true-u-ka-shifted, this somehow fixes iterative collation to match key compares??
  • ... but only if you request numeric collation before shifted variable weights. If you use en-u-ka-shifted-u-kn-true, iterative collation continues to be wrong.
  • Additionally, if you request ka=shifted before kn=true this breaks numeric sorting for both the sort keys and iterative compare.

Summary:

  • en and en-u-kn-true-u-ka-shifted seem to work as specified by Unicode.
  • en-u-ka-shifted and en-u-ka-shifted-u-kn-true silently breaks iterative comparison
  • en-u-ka-shifted-u-kn-true additionally silently doesn't do numeric sorting

I have no idea where these new behaviors come from, I haven't paged in enough of the code to understand what's happening.

@danderson
Copy link
Contributor Author

Looking more closely at UTS 35, the multi-key language tags above are not well formed because multiple -u- extensions is ill-formed (I would have hoped that language.Parse would tell me :( ), although based on the lang.String() output it looks like the parser is graciously accepting the ill-formed inputs.

I tried collations en-u-ka-shifted-kn-true, en-u-ka-shifted-kn, en-u-ka-shifted-u-kn-true, en-u-ka-shifted-u-kn, en-u-kn-ka-shifted, en-u-kn-true-ka-shifted, en-u-kn-true-u-ka-shifted, all of which are trying to request the same behavior (ka=shifted, kn=true) in various degrees of ill-formed or non-canonicality (the canonical tag is en-u-ka-shifted-kn). All of them break both iterative comparison and numeric sorting, except for the ill-formed and non-canonical en-u-kn-true-u-ka-shifted, where both comparison styles are correct and both use numeric sorting.

@danderson
Copy link
Contributor Author

Correction: en-u-kn-true-u-ka-shifted is also broken, it produces a numeric sort but fails to ignore punctuation. It's hard to see until you start sorting whole slices with these comparators, but afaict there is just no way to enable kn=true and ka=shifted at the same time, even if you avoid Compare/CompareString and use collation keys explicitly, you can only get kn=true xor ka=shifted no matter how you phrase your request.

Given that numeric sorting is a wrapper around the standard weight, I'm assuming that it lacks awareness of ka=shifted and so breaks the weight adjustments that need to happen, but again I haven't paged in enough of the code to be sure.

@gburt
Copy link

gburt commented Dec 24, 2024

I ran into this trying to replicate Postgres/glibc's UTF8 collation/sorting (specifically the glibc 2.26 posix-style aka shifted-trimmed variable weighting option) eg using en-u-ka-posix using CompareString. I'm seeing that spaces aren't sorting after letters like they should for shifted-trimmed, eg CompareString("deluge", "de luge") should be -1 but isn't.

Reading collate.go and sort.go what I don't understand is why the SortStrings/sort.go just does a bytes.Compare on the KeyFromString, but the CompareString func does something entirely different. Shouldn't they be able to use the same algorithm? Would anyone object to me replacing CompareStrings's implementation with just KeyFromString for a and b and then bytes.Compare the keys?

Here's a new unit test trying to test the TR #10 variable-weighting examples:

diff --git collate/sort_test.go collate/sort_test.go
index 4bbb227..6918bb0 100644
--- collate/sort_test.go
+++ collate/sort_test.go
@@ -5,7 +5,9 @@
 package collate_test
 
 import (
+       "bytes"
        "fmt"
+       "strings"
        "testing"
 
        "golang.org/x/text/collate"
@@ -53,3 +55,105 @@ func TestSort(t *testing.T) {
                t.Errorf("found %s; want %s", res, want)
        }
 }
+
+func TestSortStringsAndCompareString(t *testing.T) {
+       for _, tt := range []struct {
+               name string
+               c    *collate.Collator
+               want []string
+       }{
+               {
+                       name: "English default options",
+                       c:    collate.New(language.English),
+                       want: []string{
+                               "abc",
+                               "bcd",
+                               "ddd",
+                       },
+               },
+               {
+                       // From https://www.unicode.org/reports/tr10/#Variable_Weighting_Examples
+                       name: "Blanked",
+                       c:    collate.New(language.MustParse("en-us-u-ka-blanked")),
+                       want: []string{
+                               "death",
+                               "de luge",
+                               "de-luge",
+                               "deluge",
+                               "de-luge",
+                               "de Luge",
+                               "de-Luge",
+                               "deLuge",
+                               "de-Luge",
+                               "demark",
+                       },
+               },
+               {
+                       // From https://www.unicode.org/reports/tr10/#Variable_Weighting_Examples
+                       name: "Shifted",
+                       c:    collate.New(language.MustParse("en-us-u-ka-shifted")),
+                       want: []string{
+                               "death",
+                               "de luge",
+                               "de-luge",
+                               "de-luge",
+                               "deluge",
+                               "de Luge",
+                               "de-Luge",
+                               "de-Luge",
+                               "deLuge",
+                               "demark",
+                       },
+               },
+               {
+                       // From https://www.unicode.org/reports/tr10/#Variable_Weighting_Examples
+                       name: "Shift-Trimmed",
+                       c:    collate.New(language.MustParse("en-us-u-ka-posix")),
+                       want: []string{
+                               "death",
+                               "deluge",
+                               "de luge",
+                               "de-luge",
+                               "de-luge",
+                               "deLuge",
+                               "de Luge",
+                               "de-Luge",
+                               "de-Luge",
+                               "demark",
+                       },
+               },
+       } {
+               t.Run(tt.name, func(t *testing.T) {
+                       actual := make([]string, len(tt.want))
+                       copy(actual, tt.want)
+                       tt.c.SortStrings(actual)
+
+                       p := func(v []string) string { return strings.Join(v, ", ") }
+                       if p(tt.want) != p(actual) {
+                               t.Errorf("SortStrings want: '%v'\n Got: '%v'", p(tt.want), p(actual))
+                       }
+
+                       buf := collate.Buffer{}
+                       for i := 0; i < len(tt.want)-1; i++ {
+                               a, b := tt.want[i], tt.want[i+1]
+                               kA, kB := tt.c.KeyFromString(&buf, a), tt.c.KeyFromString(&buf, b)
+                               if bytes.Compare(kA, kB) > 0 {
+                                       t.Errorf("KeyFromString for %v is bigger than for %v", a, b)
+                               }
+                       }
+
+                       for i := 0; i < len(tt.want)-1; i++ {
+                               a, b := tt.want[i], tt.want[i+1]
+                               cmp := tt.c.CompareString(a, b)
+                               if cmp > 0 {
+                                       t.Errorf("CompareString for '%v' vs '%v' is 1 when should be -1 or 0", a, b)
+                               }
+                       }
+               })
+       }
+}

And here's its output for me on master:

$ go test
--- FAIL: TestSortStringsAndCompareString (0.00s)
    --- FAIL: TestSortStringsAndCompareString/Blanked (0.00s)
        sort_test.go:154: CompareString for 'death' vs 'de luge' is 1 when should be -1 or 0
        sort_test.go:154: CompareString for 'deluge' vs 'de-luge' is 1 when should be -1 or 0
        sort_test.go:154: CompareString for 'de-luge' vs 'de Luge' is 1 when should be -1 or 0
        sort_test.go:154: CompareString for 'deLuge' vs 'de-Luge' is 1 when should be -1 or 0
    --- FAIL: TestSortStringsAndCompareString/Shift-Trimmed (0.00s)
        sort_test.go:154: CompareString for 'deluge' vs 'de luge' is 1 when should be -1 or 0
        sort_test.go:154: CompareString for 'deLuge' vs 'de Luge' is 1 when should be -1 or 0

@gburt
Copy link

gburt commented Dec 24, 2024

Doing that with this patch passes all unit tests (including my new ones):

diff --git collate/collate.go collate/collate.go
index d8c23cb..44a18d8 100644
--- collate/collate.go
+++ collate/collate.go
@@ -103,37 +103,23 @@ func (b *Buffer) Reset() {
 // Compare returns an integer comparing the two byte slices.
 // The result will be 0 if a==b, -1 if a < b, and +1 if a > b.
 func (c *Collator) Compare(a, b []byte) int {
-       // TODO: skip identical prefixes once we have a fast way to detect if a rune is
-       // part of a contraction. This would lead to roughly a 10% speedup for the colcmp regtest.
-       c.iter(0).SetInput(a)
-       c.iter(1).SetInput(b)
-       if res := c.compare(); res != 0 {
-               return res
-       }
-       if !c.ignore[colltab.Identity] {
-               return bytes.Compare(a, b)
-       }
-       return 0
+       var (
+               buf Buffer
+               kA  = c.Key(&buf, a)
+               kB  = c.Key(&buf, b)
+       )
+       return bytes.Compare(kA, kB)
 }
 
 // CompareString returns an integer comparing the two strings.
 // The result will be 0 if a==b, -1 if a < b, and +1 if a > b.
 func (c *Collator) CompareString(a, b string) int {
-       // TODO: skip identical prefixes once we have a fast way to detect if a rune is
-       // part of a contraction. This would lead to roughly a 10% speedup for the colcmp regtest.
-       c.iter(0).SetInputString(a)
-       c.iter(1).SetInputString(b)
-       if res := c.compare(); res != 0 {
-               return res
-       }
-       if !c.ignore[colltab.Identity] {
-               if a < b {
-                       return -1
-               } else if a > b {
-                       return 1
-               }
-       }
-       return 0
+       var (
+               buf Buffer
+               kA  = c.KeyFromString(&buf, a)
+               kB  = c.KeyFromString(&buf, b)
+       )
+       return bytes.Compare(kA, kB)
 }

Probably not great to be creating a new buffer each time. Should we stick one on the Collator struct, since it already has state/isn't threadsafe?

@gburt
Copy link

gburt commented Dec 24, 2024

Ah but in Shift-Trimmed mode the key for "deluge" vs "de luge" are identical when they should not be -- so the Compare/key-comparison is returning 0 for them when it should return -1. I notice that the keys do not have the FFFFs appended. Is this a failure/TODO in KeyFromString or compare() func or both?

@gburt
Copy link

gburt commented Dec 24, 2024

Ah I figured out why they weren't getting the FFFFs, for en-us-u-ka-posix it was set to ignore level4/quaternary but if I use en-us-u-ka-posix-ks-level4 as my language tag then the Key-wise comparison works and it's just CompareString (not SortStrings or the "manual" keywise comparisons) that fails (unless I apply the last patch to Compare/CompareString).

@gburt
Copy link

gburt commented Dec 24, 2024

I opened a PR

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/638717 mentions this issue: collate: fix Compare[String] funcs to use key comparisons

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants