-
-
Notifications
You must be signed in to change notification settings - Fork 277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixing tokenizer not stripping "Ideographic Full Stop" (chinese full … #783
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #783 +/- ##
=======================================
Coverage 99.27% 99.27%
=======================================
Files 21 21
Lines 3587 3599 +12
=======================================
+ Hits 3561 3573 +12
Misses 26 26 ☔ View full report in Codecov by Sentry. |
Hi @reinoldus, thanks for the detailed PR! Everything works and your logic makes perfect sense but I have concerns about existing hashes (as you say) and overall speed:
|
The Python lexer provides a test with
|
Thank you for your feedback, I updated the logic so this new method is now a fallback. I also ran some benchmarks to figure out what would be the fastest way to implement the fallback logic: Results:
Looking at the data it looks like the translate method is even faster then current implementation (at least for those cases where both methods produce results -> english test case).
Not being too familiar with SimHash, but I think more tokens are more desirable. |
@reinoldus Let's go for translate then, thanks for the tests, that's quite helpful to decide. Just two things before merging the PR:
|
@adbar I've added a few tests for the sample tokens method, making sure it only gets called when it is supposed to get called and I added a few additional content_fingerprint tests. Let me know if those tests are sufficient. Also simplified the code according to your suggestions. |
STRIP_EXTENSION = re.compile(r"\.[^/?#]{2,63}$") | ||
|
||
BIN_COUNT_FUNC = getattr(int, "bit_count", lambda x: bin(x).count("1")) | ||
|
||
PUNCT_TBL = dict.fromkeys((i for i in range(sys.maxunicode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PUNCT_TBL = {i: ' ' for i in range(0x10FFFF) if unicodedata.category(chr(i)).startswith('P')}
This is simpler and entering the maximum Unicode codepoint manually removes the need for sys.
Hi @reinoldus, we're nearly there, see comment above. I also believe it would be best to make your dict static with str.maketrans on the same line. It probably makes |
Continuing the discussion from #782
After some deeper research I think the Unicode category fix would be more appropriate, because it results generally in better tokenization for mandarin. Also I moved the replacing of the characters out of the for-loop to be able to replace punctuation with blanks:
For this text (copied from news.cn):
The tokenizer in this PR produces:
Where the current one produces:
Just adding the "Ideographic Full Stop" as punctuation would also result in
[]
.The only downside of doing this change is that the fingerprints might change quite significantly, maybe we can hide this behind a feature flag to give people time to update their fingerprints?
Here is some code to verify:
Output: