-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why optimize the full model? #7
Comments
Hi @jorgemcgomes! |
Thanks for the reply @vicgalle . I see why optimizing just the final projection could fail. After all, it's just a simple linear projection. I've also tried doing that (in a different context) and got the same conclusions: very little can be done by changing just the final CLIP projection. But in my experience, you can change quite a lot by optimizing only the last few layers of the CLIP model, plus everything that comes after them (layer norm, projection). Might need to adjust the number of optimization steps though, but each step would be a lot faster... Regarding the computational cost, I think the most significant would be in terms of VRAM, not necessarily time. The optimizer states and gradients for a full CLIP text model are several GB of VRAM. |
I am also curious about the VRAM consumption of optimizing CLIP text model in runtime. |
If I understand correctly, all the weights of the CLIP text encoder are optimized, which naturally has some non-negligible computational cost.
Why was this chosen as opposed to just training part of the model?
My intuition would be to just optimize the last ~2 layers of the CLIP encoder.
Were there any experiments in this direction?
The text was updated successfully, but these errors were encountered: