Fixing tokenizer not stripping "Ideographic Full Stop" (chinese full … #783

reinoldus · 2025-02-05T05:50:46Z

Continuing the discussion from #782

After some deeper research I think the Unicode category fix would be more appropriate, because it results generally in better tokenization for mandarin. Also I moved the replacing of the characters out of the for-loop to be able to replace punctuation with blanks:

For this text (copied from news.cn):

  会谈后，两国元首共同签署《中华人民共和国和吉尔吉斯共和国关于深化新时代全面战略伙伴关系的联合声明》，见证签署共建“一带一路”合作规划以及外交、经贸、农业等领域多项合作文件。

The tokenizer in this PR produces:

['会谈后', '两国元首共同签署', '中华人民共和国和吉尔吉斯共和国关于深化新时代全面战略伙伴关系的联合声明', '见证签署共建', '一带一路', '合作规划以及外交', '经贸', '农业等领域多项合作文件']

Where the current one produces:

[]

Just adding the "Ideographic Full Stop" as punctuation would also result in [].

The only downside of doing this change is that the fingerprints might change quite significantly, maybe we can hide this behind a feature flag to give people time to update their fingerprints?

Here is some code to verify:

import unicodedata
import string
from typing import List


def sample_tokens_CURRENT(inputstring: str, length: int = 64) -> List[str]:
    """Split input into list of tokens and adjust length threshold to make sure
    there is enough data."""
    tokens = []
    for token in inputstring.split():
        token = token.strip(string.punctuation)
        if token.isalnum():
            tokens.append(token)
    sample = []
    for i in range(4, -1, -1):
        sample = [t for t in tokens if len(t) > i]
        if len(sample) >= length / 2:
            return sample
    return sample


################# SAMPLE TOKEN EASY FIX


def sample_tokens_EASY_FIX(inputstring: str, length: int = 64) -> List[str]:
    """Split input into list of tokens and adjust length threshold to make sure
    there is enough data."""
    tokens = []
    for token in inputstring.split():
        token = token.strip(string.punctuation + "。")
        if token.isalnum():
            tokens.append(token)
    sample = []
    for i in range(4, -1, -1):
        sample = [t for t in tokens if len(t) > i]
        if len(sample) >= length / 2:
            return sample
    return sample


################## SAMPLE TOKEN PROPOSE

def strip_all_punctuation(text: str) -> str:
    """Replace all Unicode punctuation characters with spaces."""
    cleaned_chars = []

    for char in text:
        # Check if character is in any Unicode punctuation category
        is_punctuation = unicodedata.category(char).startswith('P')

        # Replace punctuation with space, otherwise keep character
        cleaned_char = ' ' if is_punctuation else char
        cleaned_chars.append(cleaned_char)

    return ''.join(cleaned_chars)


def sample_tokens_unicode_fix(inputstring: str, length: int = 64) -> list[str]:
    """Split input into list of tokens and adjust length threshold to make sure
    there is enough data."""
    tokens = []

    # replace punctuation with blanks beforehand to handle none latin languages better
    inputstring = strip_all_punctuation(inputstring)

    for token in inputstring.split():
        if token.isalnum():
            tokens.append(token)
    sample = []
    for i in range(4, -1, -1):
        sample = [t for t in tokens if len(t) > i]
        if len(sample) >= length / 2:
            return sample
    return sample


# this chinese text was copied from news.cn

from trafilatura import deduplication

## Mix of languages
text = "Hello,World! こんにちは。नमस्ते! ¡Hola!"
print(sample_tokens_CURRENT(text))  # Returns: []
print(sample_tokens_EASY_FIX(text))  # Returns: []
print(sample_tokens_unicode_fix(text))  # Returns: ['Hello', 'World', 'こんにちは', 'Hola']

print("#" * 100)

# Mandarin word
mandarin_word = "行政長官岑浩。"
print(sample_tokens_CURRENT(mandarin_word)) 
print(sample_tokens_EASY_FIX(mandarin_word))
print(sample_tokens_unicode_fix(mandarin_word)) 

print("#" * 100)
# Chinese text
text_chinese = "  会谈后，两国元首共同签署《中华人民共和国和吉尔吉斯共和国关于深化新时代全面战略伙伴关系的联合声明》，见证签署共建“一带一路”合作规划以及外交、经贸、农业等领域多项合作文件。"

print(sample_tokens_CURRENT(text_chinese))  # Returns xxx
print(sample_tokens_EASY_FIX(text_chinese))  # Returns: []
print(sample_tokens_unicode_fix(
    text_chinese))  # Returns: ['会谈后', '两国元首共同签署', '中华人民共和国和吉尔吉斯共和国关于深化新时代全面战略伙伴关系的联合声明', '见证签署共建', '一带一路', '合作规划以及外交', '经贸', '农业等领域多项合作文件']

Output:

[]
[]
['Hello', 'World', 'こんにちは', 'Hola']
#############################################
[]
['行政長官岑浩']
['行政長官岑浩']
#############################################
[]
[]
['会谈后', '两国元首共同签署', '中华人民共和国和吉尔吉斯共和国关于深化新时代全面战略伙伴关系的联合声明', '见证签署共建', '一带一路', '合作规划以及外交', '经贸', '农业等领域多项合作文件']

…stop)

codecov · 2025-02-05T11:45:59Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.27%. Comparing base (42ada5a) to head (8be03e8).
Report is 1 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master     #783   +/-   ##
=======================================
  Coverage   99.27%   99.27%           
=======================================
  Files          21       21           
  Lines        3587     3599   +12     
=======================================
+ Hits         3561     3573   +12     
  Misses         26       26

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

adbar · 2025-02-05T12:07:33Z

Hi @reinoldus, thanks for the detailed PR! Everything works and your logic makes perfect sense but I have concerns about existing hashes (as you say) and overall speed:

I suggest to implement your logic as a fallback, i.e. if to only activate it if nothing is found.
Writing the new function like this should be more efficient:
[' ' if unicodedata.category(c).startswith('P') else c for c in text]
Using str.maketrans would also be an option but I'm not sure about the character range.

adbar · 2025-02-05T12:19:59Z

The Python lexer provides a test with str.isidentifier(), this would deal with Unicode chars, here a small demo:

>>> "Test".isidentifier()
True
>>> "Test.".isidentifier()
False
>>> "".join([c for c in "  会谈后，" if c.isidentifier()])
'会谈后'

reinoldus · 2025-02-06T05:39:39Z

Thank you for your feedback, I updated the logic so this new method is now a fallback. I also ran some benchmarks to figure out what would be the fastest way to implement the fallback logic:
Script can be found here: https://gist.github.com/reinoldus/c7d47feeacfc0a9e4bf0496365d8ac78

Results:

Benchmark Results (times in milliseconds)
================================================================================

Test Case: mixed_languages
----------------------------------------

Performance results:
Function              Mean     StdDev        Min        Max
------------------------------------------------------------
Current              0.003      0.001      0.002      0.023
Easy Fix             0.003      0.001      0.003      0.035
Unicode Fix          0.011      0.001      0.010      0.033
Identifier           0.005      0.001      0.005      0.014
Translate            0.005      0.001      0.005      0.018

Test Case: mandarin
----------------------------------------

Performance results:
Function              Mean     StdDev        Min        Max
------------------------------------------------------------
Current              0.003      0.000      0.003      0.009
Easy Fix             0.003      0.000      0.003      0.009
Unicode Fix          0.143      0.021      0.106      0.228
Identifier           0.036      0.003      0.031      0.061
Translate            0.028      0.003      0.025      0.060

Test Case: chinese_text
----------------------------------------

Performance results:
Function              Mean     StdDev        Min        Max
------------------------------------------------------------
Current              0.005      0.001      0.005      0.014
Easy Fix             0.005      0.001      0.005      0.021
Unicode Fix          0.463      0.023      0.434      0.654
Identifier           0.174      0.007      0.167      0.261
Translate            0.144      0.007      0.137      0.232

Test Case: english
----------------------------------------

Performance results:
Function              Mean     StdDev        Min        Max
------------------------------------------------------------
Current              0.105      0.003      0.100      0.149
Easy Fix             0.147      0.009      0.140      0.230
Unicode Fix          0.620      0.060      0.582      1.192
Identifier           0.193      0.008      0.184      0.266
Translate            0.078      0.005      0.074      0.124

Looking at the data it looks like the translate method is even faster then current implementation (at least for those cases where both methods produce results -> english test case).
Identifier is a close contender, but in the current implementation it produces different tokens for mandarin:

['会谈后', '两国元首共同签署', '中华人民共和国和吉尔吉斯共和国关于深化新时代全面战略伙伴关系的联合声明', '见证签署共建', '一带一路', '合作规划以及外交', '经贸', '农业等领域多项合作文件'] <-- translate
['会谈后两国元首共同签署中华人民共和国和吉尔吉斯共和国关于深化新时代全面战略伙伴关系的联合声明见证签署共建一带一路合作规划以及外交经贸农业等领域多项合作文件'] <-- identifier

Not being too familiar with SimHash, but I think more tokens are more desirable.

adbar · 2025-02-06T12:20:31Z

@reinoldus Let's go for translate then, thanks for the tests, that's quite helpful to decide.

Just two things before merging the PR:

The code can be simplified:
- tokens = [t for t in clean_text.split() if t.isalnum()] in your new function
- the sampling loop is now present twice, it could be a function
Please add a few cases of yours under tests/deduplication_tests.py, it will be good for regression tests

…or sample_tokens

reinoldus · 2025-02-07T06:06:31Z

@adbar I've added a few tests for the sample tokens method, making sure it only gets called when it is supposed to get called and I added a few additional content_fingerprint tests.

Let me know if those tests are sufficient.

Also simplified the code according to your suggestions.

adbar · 2025-02-07T10:06:29Z

trafilatura/deduplication.py

 STRIP_EXTENSION = re.compile(r"\.[^/?#]{2,63}$")

 BIN_COUNT_FUNC = getattr(int, "bit_count", lambda x: bin(x).count("1"))

+PUNCT_TBL = dict.fromkeys((i for i in range(sys.maxunicode)


PUNCT_TBL = {i: ' ' for i in range(0x10FFFF) if unicodedata.category(chr(i)).startswith('P')}

This is simpler and entering the maximum Unicode codepoint manually removes the need for sys.

adbar · 2025-02-07T10:11:13Z

Hi @reinoldus, we're nearly there, see comment above. I also believe it would be best to make your dict static with str.maketrans on the same line. It probably makes str.translate more efficient, otherwise the type has to be converted.

reinoldus added 2 commits February 5, 2025 07:13

Fixing tokenizer not stripping "Ideographic Full Stop" (chinese full …

25e18a9

…stop)

Bolder fix with unicode category stripping

e6c1bc5

Implemeting new implemenation as fallback

a81c5b0

removing reformats

72c823a

reinoldus added 3 commits February 7, 2025 07:59

Creating function for duplicate sample length code and adding tests f…

ef063b6

…or sample_tokens

removing reformats

e1586f7

removing redundant comments

8be03e8

adbar reviewed Feb 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing tokenizer not stripping "Ideographic Full Stop" (chinese full … #783

Fixing tokenizer not stripping "Ideographic Full Stop" (chinese full … #783

reinoldus commented Feb 5, 2025 •

edited

Loading

codecov bot commented Feb 5, 2025 •

edited

Loading

adbar commented Feb 5, 2025

adbar commented Feb 5, 2025

reinoldus commented Feb 6, 2025 •

edited

Loading

adbar commented Feb 6, 2025

reinoldus commented Feb 7, 2025

adbar Feb 7, 2025

adbar commented Feb 7, 2025

Fixing tokenizer not stripping "Ideographic Full Stop" (chinese full … #783

Are you sure you want to change the base?

Fixing tokenizer not stripping "Ideographic Full Stop" (chinese full … #783

Conversation

reinoldus commented Feb 5, 2025 • edited Loading

codecov bot commented Feb 5, 2025 • edited Loading

Codecov Report

adbar commented Feb 5, 2025

adbar commented Feb 5, 2025

reinoldus commented Feb 6, 2025 • edited Loading

adbar commented Feb 6, 2025

reinoldus commented Feb 7, 2025

adbar Feb 7, 2025

Choose a reason for hiding this comment

adbar commented Feb 7, 2025

reinoldus commented Feb 5, 2025 •

edited

Loading

codecov bot commented Feb 5, 2025 •

edited

Loading

reinoldus commented Feb 6, 2025 •

edited

Loading