Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing tokenizer not stripping "Ideographic Full Stop" (chinese full … #783

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

reinoldus
Copy link

@reinoldus reinoldus commented Feb 5, 2025

Continuing the discussion from #782

After some deeper research I think the Unicode category fix would be more appropriate, because it results generally in better tokenization for mandarin. Also I moved the replacing of the characters out of the for-loop to be able to replace punctuation with blanks:

For this text (copied from news.cn):

  会谈后,两国元首共同签署《中华人民共和国和吉尔吉斯共和国关于深化新时代全面战略伙伴关系的联合声明》,见证签署共建“一带一路”合作规划以及外交、经贸、农业等领域多项合作文件。

The tokenizer in this PR produces:

['会谈后', '两国元首共同签署', '中华人民共和国和吉尔吉斯共和国关于深化新时代全面战略伙伴关系的联合声明', '见证签署共建', '一带一路', '合作规划以及外交', '经贸', '农业等领域多项合作文件']

Where the current one produces:

[]

Just adding the "Ideographic Full Stop" as punctuation would also result in [].

The only downside of doing this change is that the fingerprints might change quite significantly, maybe we can hide this behind a feature flag to give people time to update their fingerprints?

Here is some code to verify:

import unicodedata
import string
from typing import List


def sample_tokens_CURRENT(inputstring: str, length: int = 64) -> List[str]:
    """Split input into list of tokens and adjust length threshold to make sure
    there is enough data."""
    tokens = []
    for token in inputstring.split():
        token = token.strip(string.punctuation)
        if token.isalnum():
            tokens.append(token)
    sample = []
    for i in range(4, -1, -1):
        sample = [t for t in tokens if len(t) > i]
        if len(sample) >= length / 2:
            return sample
    return sample


################# SAMPLE TOKEN EASY FIX


def sample_tokens_EASY_FIX(inputstring: str, length: int = 64) -> List[str]:
    """Split input into list of tokens and adjust length threshold to make sure
    there is enough data."""
    tokens = []
    for token in inputstring.split():
        token = token.strip(string.punctuation + "。")
        if token.isalnum():
            tokens.append(token)
    sample = []
    for i in range(4, -1, -1):
        sample = [t for t in tokens if len(t) > i]
        if len(sample) >= length / 2:
            return sample
    return sample


################## SAMPLE TOKEN PROPOSE

def strip_all_punctuation(text: str) -> str:
    """Replace all Unicode punctuation characters with spaces."""
    cleaned_chars = []

    for char in text:
        # Check if character is in any Unicode punctuation category
        is_punctuation = unicodedata.category(char).startswith('P')

        # Replace punctuation with space, otherwise keep character
        cleaned_char = ' ' if is_punctuation else char
        cleaned_chars.append(cleaned_char)

    return ''.join(cleaned_chars)


def sample_tokens_unicode_fix(inputstring: str, length: int = 64) -> list[str]:
    """Split input into list of tokens and adjust length threshold to make sure
    there is enough data."""
    tokens = []

    # replace punctuation with blanks beforehand to handle none latin languages better
    inputstring = strip_all_punctuation(inputstring)

    for token in inputstring.split():
        if token.isalnum():
            tokens.append(token)
    sample = []
    for i in range(4, -1, -1):
        sample = [t for t in tokens if len(t) > i]
        if len(sample) >= length / 2:
            return sample
    return sample


# this chinese text was copied from news.cn

from trafilatura import deduplication

## Mix of languages
text = "Hello,World! こんにちは。नमस्ते! ¡Hola!"
print(sample_tokens_CURRENT(text))  # Returns: []
print(sample_tokens_EASY_FIX(text))  # Returns: []
print(sample_tokens_unicode_fix(text))  # Returns: ['Hello', 'World', 'こんにちは', 'Hola']

print("#" * 100)

# Mandarin word
mandarin_word = "行政長官岑浩。"
print(sample_tokens_CURRENT(mandarin_word)) 
print(sample_tokens_EASY_FIX(mandarin_word))
print(sample_tokens_unicode_fix(mandarin_word)) 

print("#" * 100)
# Chinese text
text_chinese = "  会谈后,两国元首共同签署《中华人民共和国和吉尔吉斯共和国关于深化新时代全面战略伙伴关系的联合声明》,见证签署共建“一带一路”合作规划以及外交、经贸、农业等领域多项合作文件。"

print(sample_tokens_CURRENT(text_chinese))  # Returns xxx
print(sample_tokens_EASY_FIX(text_chinese))  # Returns: []
print(sample_tokens_unicode_fix(
    text_chinese))  # Returns: ['会谈后', '两国元首共同签署', '中华人民共和国和吉尔吉斯共和国关于深化新时代全面战略伙伴关系的联合声明', '见证签署共建', '一带一路', '合作规划以及外交', '经贸', '农业等领域多项合作文件']

Output:

[]
[]
['Hello', 'World', 'こんにちは', 'Hola']
#############################################
[]
['行政長官岑浩']
['行政長官岑浩']
#############################################
[]
[]
['会谈后', '两国元首共同签署', '中华人民共和国和吉尔吉斯共和国关于深化新时代全面战略伙伴关系的联合声明', '见证签署共建', '一带一路', '合作规划以及外交', '经贸', '农业等领域多项合作文件']

Copy link

codecov bot commented Feb 5, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.27%. Comparing base (42ada5a) to head (8be03e8).
Report is 1 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #783   +/-   ##
=======================================
  Coverage   99.27%   99.27%           
=======================================
  Files          21       21           
  Lines        3587     3599   +12     
=======================================
+ Hits         3561     3573   +12     
  Misses         26       26           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@adbar
Copy link
Owner

adbar commented Feb 5, 2025

Hi @reinoldus, thanks for the detailed PR! Everything works and your logic makes perfect sense but I have concerns about existing hashes (as you say) and overall speed:

  • I suggest to implement your logic as a fallback, i.e. if to only activate it if nothing is found.
  • Writing the new function like this should be more efficient:
    [' ' if unicodedata.category(c).startswith('P') else c for c in text]
    Using str.maketrans would also be an option but I'm not sure about the character range.

@adbar
Copy link
Owner

adbar commented Feb 5, 2025

The Python lexer provides a test with str.isidentifier(), this would deal with Unicode chars, here a small demo:

>>> "Test".isidentifier()
True
>>> "Test.".isidentifier()
False
>>> "".join([c for c in "  会谈后," if c.isidentifier()])
'会谈后'

@reinoldus
Copy link
Author

reinoldus commented Feb 6, 2025

Thank you for your feedback, I updated the logic so this new method is now a fallback. I also ran some benchmarks to figure out what would be the fastest way to implement the fallback logic:
Script can be found here: https://gist.github.com/reinoldus/c7d47feeacfc0a9e4bf0496365d8ac78

Results:

Benchmark Results (times in milliseconds)
================================================================================

Test Case: mixed_languages
----------------------------------------

Performance results:
Function              Mean     StdDev        Min        Max
------------------------------------------------------------
Current              0.003      0.001      0.002      0.023
Easy Fix             0.003      0.001      0.003      0.035
Unicode Fix          0.011      0.001      0.010      0.033
Identifier           0.005      0.001      0.005      0.014
Translate            0.005      0.001      0.005      0.018
Test Case: mandarin
----------------------------------------

Performance results:
Function              Mean     StdDev        Min        Max
------------------------------------------------------------
Current              0.003      0.000      0.003      0.009
Easy Fix             0.003      0.000      0.003      0.009
Unicode Fix          0.143      0.021      0.106      0.228
Identifier           0.036      0.003      0.031      0.061
Translate            0.028      0.003      0.025      0.060

Test Case: chinese_text
----------------------------------------

Performance results:
Function              Mean     StdDev        Min        Max
------------------------------------------------------------
Current              0.005      0.001      0.005      0.014
Easy Fix             0.005      0.001      0.005      0.021
Unicode Fix          0.463      0.023      0.434      0.654
Identifier           0.174      0.007      0.167      0.261
Translate            0.144      0.007      0.137      0.232

Test Case: english
----------------------------------------

Performance results:
Function              Mean     StdDev        Min        Max
------------------------------------------------------------
Current              0.105      0.003      0.100      0.149
Easy Fix             0.147      0.009      0.140      0.230
Unicode Fix          0.620      0.060      0.582      1.192
Identifier           0.193      0.008      0.184      0.266
Translate            0.078      0.005      0.074      0.124

Looking at the data it looks like the translate method is even faster then current implementation (at least for those cases where both methods produce results -> english test case).
Identifier is a close contender, but in the current implementation it produces different tokens for mandarin:

['会谈后', '两国元首共同签署', '中华人民共和国和吉尔吉斯共和国关于深化新时代全面战略伙伴关系的联合声明', '见证签署共建', '一带一路', '合作规划以及外交', '经贸', '农业等领域多项合作文件'] <-- translate
['会谈后两国元首共同签署中华人民共和国和吉尔吉斯共和国关于深化新时代全面战略伙伴关系的联合声明见证签署共建一带一路合作规划以及外交经贸农业等领域多项合作文件'] <-- identifier

Not being too familiar with SimHash, but I think more tokens are more desirable.

@adbar
Copy link
Owner

adbar commented Feb 6, 2025

@reinoldus Let's go for translate then, thanks for the tests, that's quite helpful to decide.

Just two things before merging the PR:

  1. The code can be simplified:
    • tokens = [t for t in clean_text.split() if t.isalnum()] in your new function
    • the sampling loop is now present twice, it could be a function
  2. Please add a few cases of yours under tests/deduplication_tests.py, it will be good for regression tests

@reinoldus
Copy link
Author

@adbar I've added a few tests for the sample tokens method, making sure it only gets called when it is supposed to get called and I added a few additional content_fingerprint tests.

Let me know if those tests are sufficient.

Also simplified the code according to your suggestions.

STRIP_EXTENSION = re.compile(r"\.[^/?#]{2,63}$")

BIN_COUNT_FUNC = getattr(int, "bit_count", lambda x: bin(x).count("1"))

PUNCT_TBL = dict.fromkeys((i for i in range(sys.maxunicode)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PUNCT_TBL = {i: ' ' for i in range(0x10FFFF) if unicodedata.category(chr(i)).startswith('P')}

This is simpler and entering the maximum Unicode codepoint manually removes the need for sys.

@adbar
Copy link
Owner

adbar commented Feb 7, 2025

Hi @reinoldus, we're nearly there, see comment above. I also believe it would be best to make your dict static with str.maketrans on the same line. It probably makes str.translate more efficient, otherwise the type has to be converted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants