-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] possible regression on ededupe code in release dev3 #585
Comments
@sujee can you reproduce this problem in ededup w/o first running the chunker. That is, ededup all by itself. The chunker does seem to produce different results in the two cases above, so is that somehow the real problem? |
another example working on simpler PDFs is here : https://github.com/sujee/data-prep-kit-examples/blob/main/dpk-intro/dpk_intro_1_python.ipynb |
After reviewing with @blublinsky we found out that
We have two solutions
If nobody objects, we can go with both solutions. 1) should easily unlock you, 2) will also make sure we don't fall in the same trap in the future. |
I still think that the second solution is a better option. From user's point of view Doc_id is a row identifier and I will preffere to keep it this way to avoid confusion in the future |
this seems like it could be a problem from any transform that does splitting of rows. Would another solution be to run the doc_id transform just prior to ededup and make this a general recommendation when using e/fdedup? |
For me the bigger issue was that doc-id column was misleading |
@dolfim-ibm @blublinsky thanks for investigating. Is (1) I can do within my notebook 100% ? If so, I will try that, while we put in a long term solution (2). If I can get some sample code for (1) that would be very helpful. thx
|
I was able to use method (1) get ededupe work as expected. Thanks every one! :-) |
Great. |
Search before asking
Component
Transforms/universal/tokenization
What happened + What you expected to happen
In dev1 release (expected behavior)
https://github.com/sujee/data-prep-kit/blob/rag-example1/examples/notebooks/rag/rag_1A_dpk_process_ray.ipynb
Step-3: Chunking
output
Step-4 : EDedupe
Output
In Dev-3 (incorrect behavior)
https://github.com/sujee/data-prep-kit/blob/rag-example1-dev/examples/notebooks/rag/rag_1A_dpk_process_ray.ipynb
Step-3: Chunking
output
Step-4: EDedupe
output
Note: how resulting number of chunks is only 2
This is the problem.
The result is not chunks but the documents!
This in turn breaks vector search and RAG responses.
Reproduction script
dev1 (expected behavior) : https://github.com/sujee/data-prep-kit/blob/rag-example1/examples/notebooks/rag/rag_1A_dpk_process_ray.ipynb
dev3 (incorrect behavior) : https://github.com/sujee/data-prep-kit/blob/rag-example1-dev/examples/notebooks/rag/rag_1A_dpk_process_ray.ipynb
Anything else
No response
OS
Ubuntu
Python
3.11.x
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: