-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Testing Rag notebook with latest release of pdf2Parquet, eDedup and DocID #583
Comments
This was just fixed yesterday. new install should use directly deepsearch-toolkit 1.0.1 which fixes it.
Yes, I think it should be good to go with |
Regarding the models download, I'm able to reproduce it. Can you please try again with the latest version of the branch? |
@dolfim-ibm We still have the same problem even when using the latest release. Looking at the changes, I don't see how it would have addressed this problem. Please advise. Thanks
|
makes required changes. Related : #585 |
I have made the necessary changes on my branch. Will submit a PR soon |
Search before asking
Component
Transforms/universal/doc_id, Transforms/universal/ededup, Transforms/Other, Other
What happened + What you expected to happen
@dolfim-ibm When running the rag notebook with the latest release of pdf2Parquet, the notebook crashes when downloading the model for the first time. Re-running the cell we do not see the error: If the model is already in the .EasyOCR folder, then the error will not show up. Details of the error can be found cell 6 of this notebook: https://github.com/IBM/data-prep-kit/blob/t2/examples/notebooks/rag/rag_1A_dpk_process_ray.dev3.error.ipynb
@sujee There are a few changes that need to be made to the notebook for it to work with the new release. Primarily:
replace
launcher = RayTransformLauncher(EdedupRayTransformConfiguration())
with
launcher = RayTransformLauncher(EdedupRayTransformRuntimeConfiguration())
replace
launcher = RayTransformLauncher(DocIDRayTransformConfiguration())
with
launcher = RayTransformLauncher(DocIDRayTransformRuntimeConfiguration())
replace
launcher = RayTransformLauncher(DocIDRayTransformConfiguration())
with
launcher = RayTransformLauncher(DocIDRayTransformRuntimeConfiguration())
replace
output_df.sample(3)
with
output_df.sample(len(output_df))
For a complete reference on the required changes, please see https://github.com/IBM/data-prep-kit/blob/t2/examples/notebooks/rag/rag_1A_dpk_process_ray.dev3.ipynb.
Reproduction script
data-prep-kit/examples/notebooks/rag/requirement.txt in the rag folder was modified to temporarily load the various modules from git. Once we have this issue resolved or a work around has been identified, I will create a dev3 release. For now, please use the git repo as follow:
from the browser, select and run the notebook rag_1A_dpk_process_ray.dev3.ipynb
cc: @shahrok
Anything else
@dolfim-ibm : I noticed that pdf2parquet depends on docling==1.7.0 and doc_chunk depends on docling>=1.8.2,<2.0.0. In the requirements for the notebook, I changed pdf2parquet dependency to docling>=1.7.0
@dolfim-ibm : deepsearch-toolkit 1.0.0 requires platformdirs<4.0.0,>=3.5.1, but the ray runtime prefers 4.3.2 .
OS
MacOS (limited support)
Python
3.11.x
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: