Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Image Curation Tutorial #254

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Add Image Curation Tutorial #254

wants to merge 6 commits into from

Conversation

ryantwolf
Copy link
Collaborator

@ryantwolf ryantwolf commented Sep 18, 2024

Description

Adds a tutorial showcasing a few of the key features of image curation currently offered in NeMo Curator

  1. Embedding Creation
  2. Aesthetic Classification
  3. Semantic Deduplication

Usage

Run the notebook from start to finish. It will download a sample dataset and perform the operations listed above on it.

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

@ryantwolf
Copy link
Collaborator Author

@VibhuJawa I made a few minor modifications to semdedup so that the embedding column name was more flexible. Let me know what you think.

@ryantwolf
Copy link
Collaborator Author

ryantwolf commented Sep 18, 2024

Also, if anyone has ideas on how to make the modules more flexible such that we don't need to do the silly conversion between DocumentDataset and ImageTextPairDataset for semdedupe that would be good.

Copy link
Collaborator

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ryantwolf this is a really cool tutorial! Generally looks good, I added a bunch of nits.

nemo_curator/modules/semantic_dedup.py Show resolved Hide resolved
tutorials/image-curation/image-curation.ipynb Outdated Show resolved Hide resolved
tutorials/image-curation/image-curation.ipynb Outdated Show resolved Hide resolved
tutorials/image-curation/image-curation.ipynb Outdated Show resolved Hide resolved
tutorials/image-curation/image-curation.ipynb Outdated Show resolved Hide resolved
tutorials/image-curation/image-curation.ipynb Outdated Show resolved Hide resolved
tutorials/image-curation/image-curation.ipynb Outdated Show resolved Hide resolved
tutorials/image-curation/image-curation.ipynb Outdated Show resolved Hide resolved
tutorials/image-curation/image-curation.ipynb Outdated Show resolved Hide resolved
tutorials/image-curation/image-curation.ipynb Outdated Show resolved Hide resolved
@ryantwolf
Copy link
Collaborator Author

Thanks for the feedback, should be good for another review @sarahyurick

Copy link
Collaborator

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! The outputs are very nice. I just have 2 tiny nits:

  • At beginning of notebook, add punctuation at the end of In the following notebook, we'll be exploring all of the functionality that NeMo Curator has for image dataset curation. NeMo Curator has a few built-in modules for:.
  • At the end of the notebook, add a period to Feel free to adjust the epsilon threshold and see what kinds of images are considered duplicates..

Signed-off-by: Ryan Wolf <[email protected]>
@ryantwolf
Copy link
Collaborator Author

@sarahyurick nits should be addressed

Copy link
Collaborator

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks great to me, have some minor nits regarding notebook but otherwise looks great to me.

Copy link
Collaborator

@VibhuJawa VibhuJawa Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cant leave line by line comments on the notebook . Mostly nits.

We should maybe add a Table of contents (Below will add that) , so that it is easier for folks to navigate to the section they want.

from IPython.display import Markdown, display

toc = """
# Table of Contents
1. [Download a Sample Dataset](#Download-a-Sample-Dataset)
2. [Install NeMo Curator](#Install-NeMo-Curator)
3. [Create CLIP Image Embeddings](#Create-CLIP-Image-Embeddings)
4. [Aesthetic Classifier](#Aesthetic-Classifier)
5. [Semantic Deduplication](#Semantic-Deduplication)
"""

display(Markdown(toc))
  1. We should probably add a small heading which has details about creating the dask cluster and a line or two explaining that.

Like below should be its own subsection (probably after installing curator)

from nemo_curator import get_client
client = get_client(cluster_type="gpu")
  1. For semdedup, we should probably add a comment about:
    1. #n_clusters like - Increase 'n_clusters' in ClusteringModel for more efficient processing of large datasets,
    2. #max_iters for real world large scale datasets.

Copy link
Collaborator

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. In favor of Vibhu's suggestions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants