[Bug] Testing Rag notebook with latest release of pdf2Parquet, eDedup and DocID #583

touma-I · 2024-09-10T07:10:48Z

Search before asking

I searched the issues and found no similar issues.

Component

Transforms/universal/doc_id, Transforms/universal/ededup, Transforms/Other, Other

What happened + What you expected to happen

@dolfim-ibm When running the rag notebook with the latest release of pdf2Parquet, the notebook crashes when downloading the model for the first time. Re-running the cell we do not see the error: If the model is already in the .EasyOCR folder, then the error will not show up. Details of the error can be found cell 6 of this notebook: https://github.com/IBM/data-prep-kit/blob/t2/examples/notebooks/rag/rag_1A_dpk_process_ray.dev3.error.ipynb
@sujee There are a few changes that need to be made to the notebook for it to work with the new release. Primarily:
replace launcher = RayTransformLauncher(EdedupRayTransformConfiguration())
with launcher = RayTransformLauncher(EdedupRayTransformRuntimeConfiguration())
replace launcher = RayTransformLauncher(DocIDRayTransformConfiguration())
with launcher = RayTransformLauncher(DocIDRayTransformRuntimeConfiguration())
replace launcher = RayTransformLauncher(DocIDRayTransformConfiguration())
with launcher = RayTransformLauncher(DocIDRayTransformRuntimeConfiguration())
replace output_df.sample(3)
with output_df.sample(len(output_df))

For a complete reference on the required changes, please see https://github.com/IBM/data-prep-kit/blob/t2/examples/notebooks/rag/rag_1A_dpk_process_ray.dev3.ipynb.

Reproduction script

data-prep-kit/examples/notebooks/rag/requirement.txt in the rag folder was modified to temporarily load the various modules from git. Once we have this issue resolved or a work around has been identified, I will create a dev3 release. For now, please use the git repo as follow:

git clone https://github.com/IBM/data-prep-kit.git t2
cd t2/examples/notebooks/rag && git checkout t2
python3.11 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
./venv/bin/jupyter lab

from the browser, select and run the notebook rag_1A_dpk_process_ray.dev3.ipynb

cc: @shahrok

Anything else

@dolfim-ibm : I noticed that pdf2parquet depends on docling==1.7.0 and doc_chunk depends on docling>=1.8.2,<2.0.0. In the requirements for the notebook, I changed pdf2parquet dependency to docling>=1.7.0

@dolfim-ibm : deepsearch-toolkit 1.0.0 requires platformdirs<4.0.0,>=3.5.1, but the ray runtime prefers 4.3.2 .

OS

MacOS (limited support)

Python

3.11.x

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

dolfim-ibm · 2024-09-10T07:59:15Z

@dolfim-ibm : deepsearch-toolkit 1.0.0 requires platformdirs<4.0.0,>=3.5.1, but the ray runtime prefers 4.3.2 .

This was just fixed yesterday. new install should use directly deepsearch-toolkit 1.0.1 which fixes it.

@dolfim-ibm : I noticed that pdf2parquet depends on docling==1.7.0 and doc_chunk depends on docling>=1.8.2,<2.0.0. In the requirements for the notebook, I changed pdf2parquet dependency to docling>=1.7.0

Yes, I think it should be good to go with docling>=1.7.0,<2.0.0.

dolfim-ibm · 2024-09-11T15:12:39Z

Regarding the models download, I'm able to reproduce it. Can you please try again with the latest version of the branch?

touma-I · 2024-09-11T17:43:36Z

@dolfim-ibm We still have the same problem even when using the latest release. Looking at the changes, I don't see how it would have addressed this problem. Please advise. Thanks

-        num_tables = len(doc.output.tables if doc.output.tables is not None else 0)
-        num_doc_elements = len(
-            doc.output.main_text if doc.output.main_text is not None else 0
-        )
+        num_tables = len(doc.output.tables) if doc.output.tables is not None else 0
+        num_doc_elements = len(doc.output.main_text) if doc.output.main_text is not None else 0

sujee · 2024-09-11T22:28:34Z

sujee@08024dc

makes required changes.

Related : #585

dolfim-ibm · 2024-09-20T14:12:37Z

@sujee @touma-I I think this is now resolved, can you please confirm?

sujee · 2024-09-20T17:28:42Z

I have made the necessary changes on my branch. Will submit a PR soon

touma-I added the bug Something isn't working label Sep 10, 2024

daw3rd assigned daw3rd and dolfim-ibm Sep 12, 2024

dolfim-ibm added the fixed Marks an issues as fixed in the dev branch label Sep 20, 2024

dolfim-ibm removed the fixed Marks an issues as fixed in the dev branch label Sep 20, 2024

sujee mentioned this issue Oct 4, 2024

[Bug] pdf2parquet ray version erroring out when downloading models for the very first time #667

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Testing Rag notebook with latest release of pdf2Parquet, eDedup and DocID #583

[Bug] Testing Rag notebook with latest release of pdf2Parquet, eDedup and DocID #583

touma-I commented Sep 10, 2024

dolfim-ibm commented Sep 10, 2024

dolfim-ibm commented Sep 11, 2024

touma-I commented Sep 11, 2024

sujee commented Sep 11, 2024

dolfim-ibm commented Sep 20, 2024

sujee commented Sep 20, 2024

[Bug] Testing Rag notebook with latest release of pdf2Parquet, eDedup and DocID #583

[Bug] Testing Rag notebook with latest release of pdf2Parquet, eDedup and DocID #583

Comments

touma-I commented Sep 10, 2024

Search before asking

Component

What happened + What you expected to happen

Reproduction script

Anything else

OS

Python

Are you willing to submit a PR?

dolfim-ibm commented Sep 10, 2024

dolfim-ibm commented Sep 11, 2024

touma-I commented Sep 11, 2024

sujee commented Sep 11, 2024

dolfim-ibm commented Sep 20, 2024

sujee commented Sep 20, 2024