[Bug] chunking fails on PDFs with one line text #590

sujee · 2024-09-14T07:10:33Z

Search before asking

I searched the issues and found no similar issues.

Component

Transforms/Other

What happened + What you expected to happen

I am trying process very simple PDFs. Each PDF has one line. See attached example.

When I supply 2 PDFs, I expect 2 chunks as output. But I get zero chunks as output.

a1.pdf
b1.pdf

Reproduction script

data and code are here : https://github.com/sujee/data-prep-kit/tree/test-ededupe/test/test-ededupe

Anything else

No response

OS

Ubuntu

Python

3.11.x

Are you willing to submit a PR?

Yes I am willing to submit a PR!

dolfim-ibm · 2024-09-16T06:14:28Z

In those documents, the text is detected as Page-footer, which are ignored in the markdown export and chunking.

I don't think we want to change the chunker or exporter for to include footers. This might be resolved with a new layout model, but we don't have yet an eta for it.

sujee · 2024-09-16T06:36:25Z

I am on dev3 release, here is the output for input file a1.pdf.

Does this track with your observation?

{"_name": "",
 "description": {"logs": []},
 "equations": [],
 "figures": [],
 "file-info": {"#-pages": 1,
               "document-hash": "4512df83786d672e062f144a718290982e3a8952c20ddb11014cbb3dcb9b507d",
               "filename": "a1.pdf",
               "page-hashes": [{"hash": "1a75ddf16ddb235368915aed32ab00ccd753838488ec3bb785be9bb84c0d9259",
                                "model": "default",
                                "page": 1}]},
 "footnotes": [],
 "main-text": [{"name": "Text",
                "prov": [{"bbox": [132.78564453,
                                   655.18377686,
                                   251.93409729,
                                   665.57006836],
                          "page": 1,
                          "span": [0, 29]}],
                "text": "Twinkle, twinkle, little star",
                "type": "paragraph"},
               {"name": "Page-footer",
                "prov": [{"bbox": [303.13299561,
                                   87.43224335,
                                   308.11428833,
                                   96.62137604],
                          "page": 1,
                          "span": [0, 1]}],
                "text": "1",
                "type": "page-footer"}],
 "page-dimensions": [{"height": 792.0, "page": 1, "width": 612.0}],
 "page-footers": [],
 "page-headers": [],
 "tables": [],
 "type": "pdf-document"}

dolfim-ibm · 2024-09-16T06:43:03Z

Oh right, the footer was actually the page number. This is indeed interesting. We will evaluate it more.

dolfim-ibm · 2024-09-16T08:36:45Z

The issue is actually related the text length. The chunker has a parameter for it with default value min_chunk_len=64.

In the new version of the library we actually expose it as a parameter, I think we could easily propagate it as an argument for the DPK transform as well.

dolfim-ibm · 2024-09-16T09:20:59Z

The PR #591 will add the option to customize the minimum value.

sujee · 2024-09-16T18:59:21Z

The PR #591 will add the option to customize the minimum value.

very nice 👏

sujee added the bug Something isn't working label Sep 14, 2024

Bytes-Explorer assigned dolfim-ibm Sep 16, 2024

dolfim-ibm mentioned this issue Sep 16, 2024

doc_chunk updates and new parameters #591

Merged

sujee closed this as completed Sep 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] chunking fails on PDFs with one line text #590

[Bug] chunking fails on PDFs with one line text #590

sujee commented Sep 14, 2024

dolfim-ibm commented Sep 16, 2024

sujee commented Sep 16, 2024

dolfim-ibm commented Sep 16, 2024

dolfim-ibm commented Sep 16, 2024

dolfim-ibm commented Sep 16, 2024

sujee commented Sep 16, 2024

[Bug] chunking fails on PDFs with one line text #590

[Bug] chunking fails on PDFs with one line text #590

Comments

sujee commented Sep 14, 2024

Search before asking

Component

What happened + What you expected to happen

Reproduction script

Anything else

OS

Python

Are you willing to submit a PR?

dolfim-ibm commented Sep 16, 2024

sujee commented Sep 16, 2024

dolfim-ibm commented Sep 16, 2024

dolfim-ibm commented Sep 16, 2024

dolfim-ibm commented Sep 16, 2024

sujee commented Sep 16, 2024