-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Image Curation Tutorial #254
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
@VibhuJawa I made a few minor modifications to semdedup so that the embedding column name was more flexible. Let me know what you think. |
Also, if anyone has ideas on how to make the modules more flexible such that we don't need to do the silly conversion between |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ryantwolf this is a really cool tutorial! Generally looks good, I added a bunch of nits.
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Thanks for the feedback, should be good for another review @sarahyurick |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! The outputs are very nice. I just have 2 tiny nits:
- At beginning of notebook, add punctuation at the end of
In the following notebook, we'll be exploring all of the functionality that NeMo Curator has for image dataset curation. NeMo Curator has a few built-in modules for:
. - At the end of the notebook, add a period to
Feel free to adjust the epsilon threshold and see what kinds of images are considered duplicates.
.
Signed-off-by: Ryan Wolf <[email protected]>
@sarahyurick nits should be addressed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything looks great to me, have some minor nits regarding notebook but otherwise looks great to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cant leave line by line comments on the notebook . Mostly nits.
We should maybe add a Table of contents (Below will add that) , so that it is easier for folks to navigate to the section they want.
from IPython.display import Markdown, display
toc = """
# Table of Contents
1. [Download a Sample Dataset](#Download-a-Sample-Dataset)
2. [Install NeMo Curator](#Install-NeMo-Curator)
3. [Create CLIP Image Embeddings](#Create-CLIP-Image-Embeddings)
4. [Aesthetic Classifier](#Aesthetic-Classifier)
5. [Semantic Deduplication](#Semantic-Deduplication)
"""
display(Markdown(toc))
- We should probably add a small heading which has details about creating the
dask cluster
and a line or two explaining that.
Like below should be its own subsection (probably after installing curator)
from nemo_curator import get_client
client = get_client(cluster_type="gpu")
- For semdedup, we should probably add a comment about:
1. #n_clusters like - Increase 'n_clusters' in ClusteringModel for more efficient processing of large datasets,
2. #max_iters for real world large scale datasets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. In favor of Vibhu's suggestions.
Description
Adds a tutorial showcasing a few of the key features of image curation currently offered in NeMo Curator
Usage
Run the notebook from start to finish. It will download a sample dataset and perform the operations listed above on it.
Checklist