Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve vectorizer type support #291

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

tylerhutcherson
Copy link
Collaborator

@tylerhutcherson tylerhutcherson commented Feb 27, 2025

Changes Made

  1. Expanded Type Support:

    • Updated return type signatures across all vectorizers to properly reflect the ability to return either data lists (List[float]) or binary buffers (bytes)
    • Added special handling for Cohere's integer embedding types (List[int])
  2. Standardized Interface:

    • Uniform type annotations and docstrings across all vectorizer implementations
    • Consistent default batch sizes (10) for better predictability
  3. Improved Provider-Specific Support:

    • Enhanced kwargs forwarding to allow passing provider-specific parameters
    • Better warnings for deprecated parameters (like Cohere's embedding_types)
  4. Fixed Type Checking:

    • Added strategic type ignores to resolve MyPy errors
    • Made minimal changes to consumer code to handle the expanded return types

Motivation

These changes create a more consistent and flexible vectorizer interface that:

  • Accurately represents what the methods can return
  • Accommodates provider-specific features (like Cohere's integer embeddings)
  • Provides clearer documentation for users
  • Maintains backward compatibility

Future Improvements

For future consideration:

  • Introduce helper methods (like embed_as_list()) that guarantee specific return types when needed
  • Add more robust type conversion in consumer code that relies on specific types
  • Develop a cleaner separation between the base vectorizer interface and provider-specific extensions
  • Consider a more structured approach to provider-specific parameters

@tylerhutcherson tylerhutcherson added the enhancement New feature or request label Feb 27, 2025
@tylerhutcherson tylerhutcherson changed the title Add kwargs support to all vectorizer embed methods Improve vectorizer type support Feb 27, 2025
@tylerhutcherson tylerhutcherson marked this pull request as ready for review February 27, 2025 20:47
@tylerhutcherson
Copy link
Collaborator Author

Looking for feedback on the approach here, team. Let me know what you think. This doesn't break the interface or user experience, but enables us to leverage the cohere model for int8 embeddings. It also exposes maybe some gaps in the current broader implementation that we should review in prep for a 1.x.x release later this year.

@abrookins @rbs333 @bsbodden @justin-cechmanek

@justin-cechmanek
Copy link
Collaborator

I like the standardization, and better support for passing through kwargs. No critiques from me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants