-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] improve performance of pdf2parquet #573
Comments
@dolfim-ibm Can you share any updates on this pls? |
Soon we will update DPK to use the new Docling v2. As part of the new feature (together with support for docx, html, pptx, etc) we have a new parse which is about 10x faster. See https://github.com/DS4SD/docling-parse/?tab=readme-ov-file#performance-benchmarks. Note, this is not the speed up of the full pipeline, but one of the important pieces. Medium term, we are actually running heavy benchmarks to identify the characteristic timing and compare with other tools. |
Thank you @dolfim-ibm When do you expect to integrate this change? |
Should be doable this week. |
Thanks!
From: Michele Dolfi ***@***.***>
Date: Monday, 28 October 2024 at 8:05 PM
To: IBM/data-prep-kit ***@***.***>
Cc: Hima Patel ***@***.***>, Comment ***@***.***>
Subject: [EXTERNAL] Re: [IBM/data-prep-kit] [Bug] improve performance of pdf2parquet (Issue #573)
Thank you @dolfim-ibm When do you expect to integrate this change? Should be doable this week. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented. Message ID: <IBM/data-prep-kit/issues/573/2441767461@ github. com>
Thank you @dolfim-ibm<https://github.com/dolfim-ibm> When do you expect to integrate this change?
Should be doable this week.
—
Reply to this email directly, view it on GitHub<#573 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ANKCJ6TGG7SJEGSJBG5X5CTZ5Y4SFAVCNFSM6AAAAABNV2257KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINBRG43DONBWGE>.
You are receiving this because you commented.Message ID: ***@***.***>
|
@dolfim-ibm is Docling v2 supported on windows natively? |
yes, this is supported since v1.17.0. |
The version installed with #756 should now be faster (20-30%). Additionally, you could also use the parameter
|
Search before asking
Component
Other
What happened + What you expected to happen
Extracting text from PDF into parquet seems slow. It is processing 1 page / second. So if a PDF has 300 pages, it takes 300 seconds (5 mins)
This negatively affects the user experience, as PDF2PQ is usually one of first few steps in many workflows.
Reproduction script
data : https://github.com/sujee/data-prep-kit/tree/perf-1-pdf2pq/test/perf-pdf2pq/input
(These PDFs are about 100 pages each)
Instructions and minimal code to reproduce the problem are here : https://github.com/sujee/data-prep-kit/tree/perf-1-pdf2pq/test/perf-pdf2pq
instructions (README.md) : https://github.com/sujee/data-prep-kit/blob/perf-1-pdf2pq/test/perf-pdf2pq/README.md
A py-spy generated speedscope file is attached. It can be viewed at https://www.speedscope.app/
test_pdf2pq_py.speed.txt
Anything else
No response
OS
Ubuntu
Python
3.11.x
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: