Add support for loading single file CLIPEmbedding models #6813
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
We're starting to see fine-tuned CLIPEmbedding models for improved FLUX performance appearing in the wild. For example: https://huggingface.co/zer0int/CLIP-GmP-ViT-L-14 . However, these fine-tunes are formatted as single "checkpoint"-style files rather than Transformers-compatible folders. This PR adds support for installing and loading these models.
Related Issues / Discussions
There is a problem with this implementation. The CLIP text embedder needs two models: the encoder and the tokenizer. When FLUX support was added to InvokeAI, these two models were grouped together under a single folder, and are treated as two submodels:
However, the single-file format contains just the text encoder and not the auxiliary files needed for the tokenizer. So as a workaround, when a CLIPEmbed single-file's tokenizer is requested I call
CLIPTokenizer.from_pretrained()
to download the tokenizer from theInvokeAI/clip-vit-large-patch14
HF repository. After it is downloaded, it is cached in the HuggingFace cache, so the next access does not require network. This is preferable to loading the tokenizer from the locally-installed clip model because (1) there is no guarantee that this has been installed previously; and (2) doing this would be incredibly ugly, and require the low-level loader to communicate with the high level model manager.The main downside is that the first time the tokenizer is needed, the backend will hit the network, which is something we are trying to avoid. (see PR #6740 )
QA Instructions
Use the model manager tab to install one of the "HF" format CLIPTextModel models located at https://huggingface.co/zer0int/CLIP-GmP-ViT-L-14. Try to render with it. The Tokenizer and TextEncoder should load and run successfully.
Merge Plan
Merge when approved.
Checklist