Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Testing Rag notebook with latest release of pdf2Parquet, eDedup and DocID #583

Open
1 of 2 tasks
touma-I opened this issue Sep 10, 2024 · 6 comments
Open
1 of 2 tasks
Assignees
Labels
bug Something isn't working

Comments

@touma-I
Copy link
Collaborator

touma-I commented Sep 10, 2024

Search before asking

  • I searched the issues and found no similar issues.

Component

Transforms/universal/doc_id, Transforms/universal/ededup, Transforms/Other, Other

What happened + What you expected to happen

  1. @dolfim-ibm When running the rag notebook with the latest release of pdf2Parquet, the notebook crashes when downloading the model for the first time. Re-running the cell we do not see the error: If the model is already in the .EasyOCR folder, then the error will not show up. Details of the error can be found cell 6 of this notebook: https://github.com/IBM/data-prep-kit/blob/t2/examples/notebooks/rag/rag_1A_dpk_process_ray.dev3.error.ipynb

  2. @sujee There are a few changes that need to be made to the notebook for it to work with the new release. Primarily:
    replace launcher = RayTransformLauncher(EdedupRayTransformConfiguration())
    with launcher = RayTransformLauncher(EdedupRayTransformRuntimeConfiguration())
    replace launcher = RayTransformLauncher(DocIDRayTransformConfiguration())
    with launcher = RayTransformLauncher(DocIDRayTransformRuntimeConfiguration())
    replace launcher = RayTransformLauncher(DocIDRayTransformConfiguration())
    with launcher = RayTransformLauncher(DocIDRayTransformRuntimeConfiguration())
    replace output_df.sample(3)
    with output_df.sample(len(output_df))

    For a complete reference on the required changes, please see https://github.com/IBM/data-prep-kit/blob/t2/examples/notebooks/rag/rag_1A_dpk_process_ray.dev3.ipynb.

Reproduction script

data-prep-kit/examples/notebooks/rag/requirement.txt in the rag folder was modified to temporarily load the various modules from git. Once we have this issue resolved or a work around has been identified, I will create a dev3 release. For now, please use the git repo as follow:

git clone https://github.com/IBM/data-prep-kit.git t2
cd t2/examples/notebooks/rag && git checkout t2
python3.11 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
./venv/bin/jupyter lab

from the browser, select and run the notebook rag_1A_dpk_process_ray.dev3.ipynb

cc: @shahrok

Anything else

@dolfim-ibm : I noticed that pdf2parquet depends on docling==1.7.0 and doc_chunk depends on docling>=1.8.2,<2.0.0. In the requirements for the notebook, I changed pdf2parquet dependency to docling>=1.7.0

@dolfim-ibm : deepsearch-toolkit 1.0.0 requires platformdirs<4.0.0,>=3.5.1, but the ray runtime prefers 4.3.2 .

OS

MacOS (limited support)

Python

3.11.x

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@touma-I touma-I added the bug Something isn't working label Sep 10, 2024
@dolfim-ibm
Copy link
Member

@dolfim-ibm : deepsearch-toolkit 1.0.0 requires platformdirs<4.0.0,>=3.5.1, but the ray runtime prefers 4.3.2 .

This was just fixed yesterday. new install should use directly deepsearch-toolkit 1.0.1 which fixes it.

@dolfim-ibm : I noticed that pdf2parquet depends on docling==1.7.0 and doc_chunk depends on docling>=1.8.2,<2.0.0. In the requirements for the notebook, I changed pdf2parquet dependency to docling>=1.7.0

Yes, I think it should be good to go with docling>=1.7.0,<2.0.0.

@dolfim-ibm
Copy link
Member

Regarding the models download, I'm able to reproduce it. Can you please try again with the latest version of the branch?

@touma-I
Copy link
Collaborator Author

touma-I commented Sep 11, 2024

@dolfim-ibm We still have the same problem even when using the latest release. Looking at the changes, I don't see how it would have addressed this problem. Please advise. Thanks

-        num_tables = len(doc.output.tables if doc.output.tables is not None else 0)
-        num_doc_elements = len(
-            doc.output.main_text if doc.output.main_text is not None else 0
-        )
+        num_tables = len(doc.output.tables) if doc.output.tables is not None else 0
+        num_doc_elements = len(doc.output.main_text) if doc.output.main_text is not None else 0
 

@sujee
Copy link
Contributor

sujee commented Sep 11, 2024

sujee@08024dc

makes required changes.

Related : #585

@dolfim-ibm dolfim-ibm added the fixed Marks an issues as fixed in the dev branch label Sep 20, 2024
@dolfim-ibm
Copy link
Member

@sujee @touma-I I think this is now resolved, can you please confirm?

@dolfim-ibm dolfim-ibm removed the fixed Marks an issues as fixed in the dev branch label Sep 20, 2024
@sujee
Copy link
Contributor

sujee commented Sep 20, 2024

I have made the necessary changes on my branch. Will submit a PR soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants