update

AILab-CVC · Feb 4, 2025 · b93669f · b93669f
1 parent 7dc8020
commit b93669f
Show file tree

Hide file tree

Showing 2 changed files with 19 additions and 3 deletions.
diff --git a/assets/image_prompt.png b/assets/image_prompt.png
diff --git a/docs/update_20250123.md b/docs/update_20250123.md
@@ -1,7 +1,7 @@
 ## YOLO-World-V2.1 Update Blog
 
 **Contribtor:** [Tianheng Cheng](), [Haokun Lin](), and [Yixiao Ge]().\
-**Date:** 2025.02.
+**Date:** 2025.02.05
 
 ### Summary
 Hey guys, long time no see. Recently, we've made a series of updates to YOLO-World, including improvements of the pre-trained models, and we've also fully released the training code for YOLO-World Image Prompts. We will continue to optimize YOLO-World in the future.
@@ -30,12 +30,28 @@ Currently, users still need to consider padding (`" "`) in the input vocabulary.
 In previous versions, we used named entity extraction to annotate image-text data, such as CC3M. This approach introduced considerable noise and resulted in sparse image annotations, leading to low image utilization. To address this, we employed RAM++ for image tag annotation and combined [RAM++](https://github.com/xinyu1205/recognize-anything) annotations with extracted named entities to form the annotation vocabulary. Additionally, we used the YOLO-World-X-v2 model for annotation, generating an equal amount of 250k data samples.
 
 
-**3. YOLO-World-Image: Visual Prompts**
+**3. YOLO-World-Image: Image Prompts**
 
+*Image Prompt:*
+We've noticed that many users are very interested in using image prompts with YOLO-World. Previously, we provided a [preview version](https://huggingface.co/spaces/wondervictor/YOLO-World-Image). Therefore, in this update, we will provide a detailed introduction to the Image Prompt model and its training process.
+
+*Image Prompt Adapter:*
+YOLO-World uses CLIP-Text as the text encoder to encode text prompts into text embeddings. Since CLIP's pre-training has aligned the text and visual encoders, it naturally follows that we can directly use CLIP's visual encoder to encode image prompts into corresponding image embeddings, replacing text embeddings to achieve object detection with image prompts.
+After obtaining the image embeddings, all subsequent steps remain identical to those in YOLO-World with text embeddings, including the text-to-image T-CSPLayer.
+
+*Prompt Adapter:* While this approach is feasible, its actual performance is relatively mediocre. This can be attributed to the fact that CLIP's visual and textual alignments only exist at the contrastive level, making direct substitution ineffective.
+To this end, we introduced a simple adapter, consisting of a straightforward MLP, to further align the visual prompt embeddings with text embeddings, as shown in the below figure.
+
+<div align="center">
+<img width=50% src="../assets/image_prompt.png">
+</div>
+
+
+*Training:* Taking the COCO dataset as an example, for each existing category in each image, we randomly select a **query bbox** and crop out the corresponding image region. We then use the CLIP Image Encoder to extract the corresponding **image embeddings**.
 
 
 ### Evaluation Results
-We evaluate all YOLO-World models on LVIS, LVIS-mini, and COCO in the zero-shot manner.
+We evaluate all YOLO-World-V2.1 models on LVIS, LVIS-mini, and COCO in the zero-shot manner.
 
 <table>
     <tr>