Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] possible regression on ededupe code in release dev3 #585

Closed
1 of 2 tasks
sujee opened this issue Sep 11, 2024 · 9 comments
Closed
1 of 2 tasks

[Bug] possible regression on ededupe code in release dev3 #585

sujee opened this issue Sep 11, 2024 · 9 comments
Assignees
Labels
bug Something isn't working fixed Marks an issues as fixed in the dev branch

Comments

@sujee
Copy link
Contributor

sujee commented Sep 11, 2024

Search before asking

  • I searched the issues and found no similar issues.

Component

Transforms/universal/tokenization

What happened + What you expected to happen

In dev1 release (expected behavior)

https://github.com/sujee/data-prep-kit/blob/rag-example1/examples/notebooks/rag/rag_1A_dpk_process_ray.ipynb

Step-3: Chunking

output

Files processed : 3
Chunks created : 2,042
Input data dimensions (rows x columns)=  (3, 12)
Output data dimensions (rows x columns)=  (2042, 15)

Step-4 : EDedupe

Output

Input data dimensions (rows x columns)=  (2042, 15)
Output data dimensions (rows x columns)=  (1324, 15)
Input chunks before exact dedupe : 2,042
Output chunks after exact dedupe : 1,324
Duplicate chunks removed :   718

In Dev-3 (incorrect behavior)

https://github.com/sujee/data-prep-kit/blob/rag-example1-dev/examples/notebooks/rag/rag_1A_dpk_process_ray.ipynb

Step-3: Chunking

output

Files processed : 3
Chunks created : 1,973
Input data dimensions (rows x columns)=  (3, 12)
Output data dimensions (rows x columns)=  (1973, 15)

Step-4: EDedupe

output

Input data dimensions (rows x columns)=  (1973, 15)
Output data dimensions (rows x columns)=  (2, 16)
Input chunks before exact dedupe : 1,973
Output chunks after exact dedupe : 2
Duplicate chunks removed :   1971

Note: how resulting number of chunks is only 2

This is the problem.

The result is not chunks but the documents!

This in turn breaks vector search and RAG responses.

Reproduction script

dev1 (expected behavior) : https://github.com/sujee/data-prep-kit/blob/rag-example1/examples/notebooks/rag/rag_1A_dpk_process_ray.ipynb

dev3 (incorrect behavior) : https://github.com/sujee/data-prep-kit/blob/rag-example1-dev/examples/notebooks/rag/rag_1A_dpk_process_ray.ipynb

Anything else

No response

OS

Ubuntu

Python

3.11.x

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@daw3rd
Copy link
Member

daw3rd commented Sep 12, 2024

@sujee can you reproduce this problem in ededup w/o first running the chunker. That is, ededup all by itself. The chunker does seem to produce different results in the two cases above, so is that somehow the real problem?

@sujee
Copy link
Contributor Author

sujee commented Sep 13, 2024

another example working on simpler PDFs is here : https://github.com/sujee/data-prep-kit-examples/blob/main/dpk-intro/dpk_intro_1_python.ipynb

@dolfim-ibm
Copy link
Member

After reviewing with @blublinsky we found out that

  • what you observe is because ededupe is using the doc_id column as a first criteria
  • the doc_chunk transform is keeping the source doc_id, i.e. all chunks belonging to the same source document will have the same doc_id

We have two solutions

  1. In the current notebook, after doc_chunk we could run the doc_id transform pointing it to a different column, i.e. hash_column="chunk_id", then ededupe can be configured with doc_id_column="chunk_id".
  2. We update the doc_chunk transform such that it renames doc_id to source_doc_id. This way the doc_id can still be used as a "row identifier".

If nobody objects, we can go with both solutions. 1) should easily unlock you, 2) will also make sure we don't fall in the same trap in the future.

@blublinsky
Copy link
Collaborator

I still think that the second solution is a better option. From user's point of view Doc_id is a row identifier and I will preffere to keep it this way to avoid confusion in the future

@daw3rd
Copy link
Member

daw3rd commented Sep 18, 2024

this seems like it could be a problem from any transform that does splitting of rows. Would another solution be to run the doc_id transform just prior to ededup and make this a general recommendation when using e/fdedup?

@blublinsky
Copy link
Collaborator

For me the bigger issue was that doc-id column was misleading

@sujee
Copy link
Contributor Author

sujee commented Sep 18, 2024

@dolfim-ibm @blublinsky thanks for investigating.

Is (1) I can do within my notebook 100% ? If so, I will try that, while we put in a long term solution (2).

If I can get some sample code for (1) that would be very helpful. thx

We have two solutions

  1. In the current notebook, after doc_chunk we could run the doc_id transform pointing it to a different column, i.e. hash_column="chunk_id", then ededupe can be configured with doc_id_column="chunk_id".
  2. We update the doc_chunk transform such that it renames doc_id to source_doc_id. This way the doc_id can still be used as a "row identifier".

If nobody objects, we can go with both solutions. 1) should easily unlock you, 2) will also make sure we don't fall in the same trap in the future.

@sujee
Copy link
Contributor Author

sujee commented Sep 19, 2024

I was able to use method (1) get ededupe work as expected. Thanks every one! :-)

@blublinsky
Copy link
Collaborator

Great.
Thanks @dolfim-ibm for fixing this so quickly
Can we, please, close this one.

@dolfim-ibm dolfim-ibm added the fixed Marks an issues as fixed in the dev branch label Sep 20, 2024
@sujee sujee closed this as completed Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working fixed Marks an issues as fixed in the dev branch
Projects
None yet
Development

No branches or pull requests

4 participants