[Bug] possible regression on ededupe code in release dev3 #585

sujee · 2024-09-11T22:20:54Z

Search before asking

I searched the issues and found no similar issues.

Component

Transforms/universal/tokenization

What happened + What you expected to happen

In dev1 release (expected behavior)

https://github.com/sujee/data-prep-kit/blob/rag-example1/examples/notebooks/rag/rag_1A_dpk_process_ray.ipynb

Step-3: Chunking

output

Files processed : 3
Chunks created : 2,042
Input data dimensions (rows x columns)=  (3, 12)
Output data dimensions (rows x columns)=  (2042, 15)

Step-4 : EDedupe

Output

Input data dimensions (rows x columns)=  (2042, 15)
Output data dimensions (rows x columns)=  (1324, 15)
Input chunks before exact dedupe : 2,042
Output chunks after exact dedupe : 1,324
Duplicate chunks removed :   718

In Dev-3 (incorrect behavior)

https://github.com/sujee/data-prep-kit/blob/rag-example1-dev/examples/notebooks/rag/rag_1A_dpk_process_ray.ipynb

Step-3: Chunking

output

Files processed : 3
Chunks created : 1,973
Input data dimensions (rows x columns)=  (3, 12)
Output data dimensions (rows x columns)=  (1973, 15)

Step-4: EDedupe

output

Input data dimensions (rows x columns)=  (1973, 15)
Output data dimensions (rows x columns)=  (2, 16)
Input chunks before exact dedupe : 1,973
Output chunks after exact dedupe : 2
Duplicate chunks removed :   1971

Note: how resulting number of chunks is only 2

This is the problem.

The result is not chunks but the documents!

This in turn breaks vector search and RAG responses.

Reproduction script

dev1 (expected behavior) : https://github.com/sujee/data-prep-kit/blob/rag-example1/examples/notebooks/rag/rag_1A_dpk_process_ray.ipynb

dev3 (incorrect behavior) : https://github.com/sujee/data-prep-kit/blob/rag-example1-dev/examples/notebooks/rag/rag_1A_dpk_process_ray.ipynb

Anything else

No response

OS

Ubuntu

Python

3.11.x

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

daw3rd · 2024-09-12T17:31:06Z

@sujee can you reproduce this problem in ededup w/o first running the chunker. That is, ededup all by itself. The chunker does seem to produce different results in the two cases above, so is that somehow the real problem?

sujee · 2024-09-13T15:20:57Z

another example working on simpler PDFs is here : https://github.com/sujee/data-prep-kit-examples/blob/main/dpk-intro/dpk_intro_1_python.ipynb

dolfim-ibm · 2024-09-18T08:36:10Z

After reviewing with @blublinsky we found out that

what you observe is because ededupe is using the doc_id column as a first criteria
the doc_chunk transform is keeping the source doc_id, i.e. all chunks belonging to the same source document will have the same doc_id

We have two solutions

In the current notebook, after doc_chunk we could run the doc_id transform pointing it to a different column, i.e. hash_column="chunk_id", then ededupe can be configured with doc_id_column="chunk_id".
We update the doc_chunk transform such that it renames doc_id to source_doc_id. This way the doc_id can still be used as a "row identifier".

If nobody objects, we can go with both solutions. 1) should easily unlock you, 2) will also make sure we don't fall in the same trap in the future.

blublinsky · 2024-09-18T09:41:27Z

I still think that the second solution is a better option. From user's point of view Doc_id is a row identifier and I will preffere to keep it this way to avoid confusion in the future

daw3rd · 2024-09-18T12:35:56Z

this seems like it could be a problem from any transform that does splitting of rows. Would another solution be to run the doc_id transform just prior to ededup and make this a general recommendation when using e/fdedup?

blublinsky · 2024-09-18T12:48:34Z

For me the bigger issue was that doc-id column was misleading

sujee · 2024-09-18T15:17:02Z

@dolfim-ibm @blublinsky thanks for investigating.

Is (1) I can do within my notebook 100% ? If so, I will try that, while we put in a long term solution (2).

If I can get some sample code for (1) that would be very helpful. thx

We have two solutions

In the current notebook, after doc_chunk we could run the doc_id transform pointing it to a different column, i.e. hash_column="chunk_id", then ededupe can be configured with doc_id_column="chunk_id".

We update the doc_chunk transform such that it renames doc_id to source_doc_id. This way the doc_id can still be used as a "row identifier".

If nobody objects, we can go with both solutions. 1) should easily unlock you, 2) will also make sure we don't fall in the same trap in the future.

sujee · 2024-09-19T05:22:34Z

I was able to use method (1) get ededupe work as expected. Thanks every one! :-)

blublinsky · 2024-09-19T07:26:42Z

Great.
Thanks @dolfim-ibm for fixing this so quickly
Can we, please, close this one.

sujee added the bug Something isn't working label Sep 11, 2024

sujee mentioned this issue Sep 11, 2024

[Bug] Testing Rag notebook with latest release of pdf2Parquet, eDedup and DocID #583

Open

2 tasks

daw3rd assigned daw3rd and dolfim-ibm Sep 12, 2024

dolfim-ibm mentioned this issue Sep 18, 2024

doc_id and source_doc_id params in doc_chunk #598

Merged

dolfim-ibm added the fixed Marks an issues as fixed in the dev branch label Sep 20, 2024

sujee closed this as completed Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] possible regression on ededupe code in release dev3 #585

[Bug] possible regression on ededupe code in release dev3 #585

sujee commented Sep 11, 2024 •

edited

Loading

daw3rd commented Sep 12, 2024

sujee commented Sep 13, 2024

dolfim-ibm commented Sep 18, 2024

blublinsky commented Sep 18, 2024

daw3rd commented Sep 18, 2024

blublinsky commented Sep 18, 2024

sujee commented Sep 18, 2024

sujee commented Sep 19, 2024

blublinsky commented Sep 19, 2024

[Bug] possible regression on ededupe code in release dev3 #585

[Bug] possible regression on ededupe code in release dev3 #585

Comments

sujee commented Sep 11, 2024 • edited Loading

Search before asking

Component

What happened + What you expected to happen

In dev1 release (expected behavior)

Step-3: Chunking

Step-4 : EDedupe

In Dev-3 (incorrect behavior)

Step-3: Chunking

Step-4: EDedupe

Reproduction script

Anything else

OS

Python

Are you willing to submit a PR?

daw3rd commented Sep 12, 2024

sujee commented Sep 13, 2024

dolfim-ibm commented Sep 18, 2024

blublinsky commented Sep 18, 2024

daw3rd commented Sep 18, 2024

blublinsky commented Sep 18, 2024

sujee commented Sep 18, 2024

sujee commented Sep 19, 2024

blublinsky commented Sep 19, 2024

sujee commented Sep 11, 2024 •

edited

Loading