Skip to content

Commit

Permalink
Function for scanning UTF-8 sequences
Browse files Browse the repository at this point in the history
This is pretty redundant given the built-in functionality in Go, but
sometimes it's convenient to have everything expressed as a
bufio.SplitFunc to get general behavior across any set of tokens, like
with the helper functions in this package.
  • Loading branch information
apparentlymart committed May 27, 2017
1 parent 896d476 commit c658837
Showing 1 changed file with 19 additions and 0 deletions.
19 changes: 19 additions & 0 deletions textseg/utf8_seqs.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
package textseg

import "unicode/utf8"

// ScanGraphemeClusters is a split function for bufio.Scanner that splits
// on UTF8 sequence boundaries.
//
// This is included largely for completeness, since this behavior is already
// built in to Go when ranging over a string.
func ScanUTF8Sequences(data []byte, atEOF bool) (int, []byte, error) {
if len(data) == 0 {
return 0, nil, nil
}
r, seqLen := utf8.DecodeRune(data)
if r == utf8.RuneError && !atEOF {
return 0, nil, nil
}
return seqLen, data[:seqLen], nil
}

0 comments on commit c658837

Please sign in to comment.