-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x/text/collate: Compare and CompareString broken in "ka-shifted" collation mode #68166
Comments
Similar Issues
(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.) |
Pretty simple bug: https://cs.opensource.google/go/x/text/+/refs/tags/v0.16.0:collate/collate.go;l=166 . When the collator is in the shifted alternate mode, Looking at the history, x/text/collate seems unmaintained (last change that wasn't fmts or Unicode upgrades was 2016 :( ). Given this, my suggestion tactically would be to rip out the custom codepaths for Compare/CompareString, and have them generate and bytes.Compare full collation keys. It's less efficient for one-off compares, but I think it would be correct, and a future maintainer can always make it faster again since it's all internal package gubbins. |
CC @mpvl |
I hacked up a more comprehensive debugging tool: https://go.dev/play/p/GcsuboreVW8 . For a bunch of comparisons and collations, it prints:
There's also main() in there suitable for running from the CLI as well. What I reported above can be seen in the Further weirdness I discovered while writing the tool:
Summary:
I have no idea where these new behaviors come from, I haven't paged in enough of the code to understand what's happening. |
Looking more closely at UTS 35, the multi-key language tags above are not well formed because multiple I tried collations |
Correction: Given that numeric sorting is a wrapper around the standard weight, I'm assuming that it lacks awareness of |
I ran into this trying to replicate Postgres/glibc's UTF8 collation/sorting (specifically the glibc 2.26 posix-style aka shifted-trimmed variable weighting option) eg using Reading collate.go and sort.go what I don't understand is why the SortStrings/sort.go just does a bytes.Compare on the KeyFromString, but the Here's a new unit test trying to test the TR #10 variable-weighting examples: diff --git collate/sort_test.go collate/sort_test.go
index 4bbb227..6918bb0 100644
--- collate/sort_test.go
+++ collate/sort_test.go
@@ -5,7 +5,9 @@
package collate_test
import (
+ "bytes"
"fmt"
+ "strings"
"testing"
"golang.org/x/text/collate"
@@ -53,3 +55,105 @@ func TestSort(t *testing.T) {
t.Errorf("found %s; want %s", res, want)
}
}
+
+func TestSortStringsAndCompareString(t *testing.T) {
+ for _, tt := range []struct {
+ name string
+ c *collate.Collator
+ want []string
+ }{
+ {
+ name: "English default options",
+ c: collate.New(language.English),
+ want: []string{
+ "abc",
+ "bcd",
+ "ddd",
+ },
+ },
+ {
+ // From https://www.unicode.org/reports/tr10/#Variable_Weighting_Examples
+ name: "Blanked",
+ c: collate.New(language.MustParse("en-us-u-ka-blanked")),
+ want: []string{
+ "death",
+ "de luge",
+ "de-luge",
+ "deluge",
+ "de-luge",
+ "de Luge",
+ "de-Luge",
+ "deLuge",
+ "de-Luge",
+ "demark",
+ },
+ },
+ {
+ // From https://www.unicode.org/reports/tr10/#Variable_Weighting_Examples
+ name: "Shifted",
+ c: collate.New(language.MustParse("en-us-u-ka-shifted")),
+ want: []string{
+ "death",
+ "de luge",
+ "de-luge",
+ "de-luge",
+ "deluge",
+ "de Luge",
+ "de-Luge",
+ "de-Luge",
+ "deLuge",
+ "demark",
+ },
+ },
+ {
+ // From https://www.unicode.org/reports/tr10/#Variable_Weighting_Examples
+ name: "Shift-Trimmed",
+ c: collate.New(language.MustParse("en-us-u-ka-posix")),
+ want: []string{
+ "death",
+ "deluge",
+ "de luge",
+ "de-luge",
+ "de-luge",
+ "deLuge",
+ "de Luge",
+ "de-Luge",
+ "de-Luge",
+ "demark",
+ },
+ },
+ } {
+ t.Run(tt.name, func(t *testing.T) {
+ actual := make([]string, len(tt.want))
+ copy(actual, tt.want)
+ tt.c.SortStrings(actual)
+
+ p := func(v []string) string { return strings.Join(v, ", ") }
+ if p(tt.want) != p(actual) {
+ t.Errorf("SortStrings want: '%v'\n Got: '%v'", p(tt.want), p(actual))
+ }
+
+ buf := collate.Buffer{}
+ for i := 0; i < len(tt.want)-1; i++ {
+ a, b := tt.want[i], tt.want[i+1]
+ kA, kB := tt.c.KeyFromString(&buf, a), tt.c.KeyFromString(&buf, b)
+ if bytes.Compare(kA, kB) > 0 {
+ t.Errorf("KeyFromString for %v is bigger than for %v", a, b)
+ }
+ }
+
+ for i := 0; i < len(tt.want)-1; i++ {
+ a, b := tt.want[i], tt.want[i+1]
+ cmp := tt.c.CompareString(a, b)
+ if cmp > 0 {
+ t.Errorf("CompareString for '%v' vs '%v' is 1 when should be -1 or 0", a, b)
+ }
+ }
+ })
+ }
+} And here's its output for me on master: $ go test
--- FAIL: TestSortStringsAndCompareString (0.00s)
--- FAIL: TestSortStringsAndCompareString/Blanked (0.00s)
sort_test.go:154: CompareString for 'death' vs 'de luge' is 1 when should be -1 or 0
sort_test.go:154: CompareString for 'deluge' vs 'de-luge' is 1 when should be -1 or 0
sort_test.go:154: CompareString for 'de-luge' vs 'de Luge' is 1 when should be -1 or 0
sort_test.go:154: CompareString for 'deLuge' vs 'de-Luge' is 1 when should be -1 or 0
--- FAIL: TestSortStringsAndCompareString/Shift-Trimmed (0.00s)
sort_test.go:154: CompareString for 'deluge' vs 'de luge' is 1 when should be -1 or 0
sort_test.go:154: CompareString for 'deLuge' vs 'de Luge' is 1 when should be -1 or 0 |
Doing that with this patch passes all unit tests (including my new ones): diff --git collate/collate.go collate/collate.go
index d8c23cb..44a18d8 100644
--- collate/collate.go
+++ collate/collate.go
@@ -103,37 +103,23 @@ func (b *Buffer) Reset() {
// Compare returns an integer comparing the two byte slices.
// The result will be 0 if a==b, -1 if a < b, and +1 if a > b.
func (c *Collator) Compare(a, b []byte) int {
- // TODO: skip identical prefixes once we have a fast way to detect if a rune is
- // part of a contraction. This would lead to roughly a 10% speedup for the colcmp regtest.
- c.iter(0).SetInput(a)
- c.iter(1).SetInput(b)
- if res := c.compare(); res != 0 {
- return res
- }
- if !c.ignore[colltab.Identity] {
- return bytes.Compare(a, b)
- }
- return 0
+ var (
+ buf Buffer
+ kA = c.Key(&buf, a)
+ kB = c.Key(&buf, b)
+ )
+ return bytes.Compare(kA, kB)
}
// CompareString returns an integer comparing the two strings.
// The result will be 0 if a==b, -1 if a < b, and +1 if a > b.
func (c *Collator) CompareString(a, b string) int {
- // TODO: skip identical prefixes once we have a fast way to detect if a rune is
- // part of a contraction. This would lead to roughly a 10% speedup for the colcmp regtest.
- c.iter(0).SetInputString(a)
- c.iter(1).SetInputString(b)
- if res := c.compare(); res != 0 {
- return res
- }
- if !c.ignore[colltab.Identity] {
- if a < b {
- return -1
- } else if a > b {
- return 1
- }
- }
- return 0
+ var (
+ buf Buffer
+ kA = c.KeyFromString(&buf, a)
+ kB = c.KeyFromString(&buf, b)
+ )
+ return bytes.Compare(kA, kB)
} Probably not great to be creating a new buffer each time. Should we stick one on the Collator struct, since it already has state/isn't threadsafe? |
Ah but in Shift-Trimmed mode the key for "deluge" vs "de luge" are identical when they should not be -- so the Compare/key-comparison is returning 0 for them when it should return -1. I notice that the keys do not have the FFFFs appended. Is this a failure/TODO in KeyFromString or compare() func or both? |
Ah I figured out why they weren't getting the FFFFs, for |
I opened a PR |
Change https://go.dev/cl/638717 mentions this issue: |
Go version
go version go1.22.3 linux/amd64
Output of
go env
in your module/workspace:What did you do?
Tried to compare strings with the "shifted" variable weighting option, which you can select with e.g. language tag
en-u-ka-shifted
.Demo: https://go.dev/play/p/afAOLzqh3Oe
What did you see happen?
99% of Compare() and CompareString() calls on different strings return 0.
What did you expect to see?
Comparison results that match the configured collation, or at least a courtesy panic letting me know it's known broken (spoiler alert...)
The text was updated successfully, but these errors were encountered: