Skip to content
/ RCAR Public

[TIP2023] The code of “Plug-and-Play Regulators for Image-Text Matching”

License

Notifications You must be signed in to change notification settings

Paranioar/RCAR

Repository files navigation

RCAR

PyTorch implementation for TIP2023 paper of “Plug-and-Play Regulators for Image-Text Matching”.

It is built on top of the SGRAF, GPO and Awesome_Matching.

If any problems, please contact me at [email protected]. ([email protected] is deprecated)

Introduction

The framework of RCAR:

The reported results (One can import GloVe Embedding or BERT for better results)

Dataset Module Sentence retrieval Image retrieval
R@1R@5R@10 R@1R@5R@10
Flick30k T2I 79.795.097.4 60.984.490.1
I2T 76.995.598.0 58.883.989.3
ALL 82.396.098.4 62.685.891.1
MSCOCO1k T2I 79.196.598.8 63.990.795.9
I2T 79.396.598.8 63.890.495.8
ALL 80.996.998.9 65.791.496.4
MSCOCO5k T2I 59.184.891.8 42.871.581.9
I2T 58.484.691.9 41.771.481.7
ALL 61.386.192.6 44.373.283.2

Requirements

Utilize pip install -r requirements.txt for the following dependencies.

  • Python 3.7.11
  • PyTorch 1.7.1
  • NumPy 1.21.5
  • Punkt Sentence Tokenizer:
import nltk
nltk.download()
> d punkt

Download data and vocab

We follow SCAN to obtain image features and vocabularies, which can be downloaded by using:

https://www.kaggle.com/datasets/kuanghueilee/scan-features

Another download link is available below:

https://drive.google.com/drive/u/0/folders/1os1Kr7HeTbh8FajBNegW8rjJf6GIhFqC
data
├── coco
│   ├── precomp  # pre-computed BUTD region features for COCO, provided by SCAN
│   │      ├── train_ids.txt
│   │      ├── train_caps.txt
│   │      ├── ......
│   │
│   └── id_mapping.json  # mapping from coco-id to image's file name
│   
│
├── f30k
│   ├── precomp  # pre-computed BUTD region features for Flickr30K, provided by SCAN
│   │      ├── train_ids.txt
│   │      ├── train_caps.txt
│   │      ├── ......
│   │
│   └── id_mapping.json  # mapping from f30k index to image's file name
│   
│
└── vocab  # vocab files provided by SCAN (only used when the text backbone is BiGRU)

Pre-trained models and evaluation

Modify the model_path, split, fold5 in the eval.py file. Note that fold5=True is only for evaluation on mscoco1K (5 folders average) while fold5=False for mscoco5K and flickr30K. Pretrained models and Log files can be downloaded from Flickr30K_RCAR and MSCOCO_RCAR.

Then run python eval.py in the terminal.

Training new models from scratch

Uncomment the required parts of BASELINE, RAR, RCR, RCAR in the train_xxxx_xxx.sh file.

Then run ./train_xxx_xxx.sh in the terminal:

Reference

If RCAR is useful for your research, please cite the following paper:

  @article{Diao2023RCAR,
     author={Diao, Haiwen and Zhang, Ying and Liu, Wei and Ruan, Xiang and Lu, Huchuan},
     journal={IEEE Transactions on Image Processing}, 
     title={Plug-and-Play Regulators for Image-Text Matching}, 
     year={2023},
     volume={32},
     pages={2322-2334}
  }

License

Apache License 2.0.