This repository is the official implementation of our ICMR 2025 paper, "Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval".
conda create -n MaTIR python=3.12
conda activate MaTIR
pip install -r requirements.txt
git clone https://github.com/facebookresearch/sam2.git
mv sam2 SAM2 && cd SAM2 && pip install -e . && cd ..
git clone https://github.com/SunzeY/AlphaCLIP.git
cd AlphaCLIP && pip install -e . && cd ..
Optionally install flash attention:
pip install flash-attn --no-build-isolation
Download Alpha-CLIP checkpoint:
mkdir AlphaCLIP/checkpoints
wget -O AlphaCLIP/checkpoints/clip_l14_grit20m_fultune_2xe.pth \
https://download.openxlab.org.cn/models/SunzeY/AlphaCLIP/weight/clip_l14_grit20m_fultune_2xe.pth
-
$D^3$ dataset is from "Described Object Detection: Liberating Object Detection with Flexible Expressions" (NeurIPS 2023), download from here - Organize as follows:
datasets/
├── coco/
| ├── annotations/
| └── val2017/
├── d3/
| ├── d3_images/
| ├── d3_json/
| └── d3_pkl/
├── coco.py
└── d3.py
Our pipeline consists of three steps: Segmentation-aware text-to-image-retrieval (TIR), MLLM-based reranking, and referring expression segmentation (RES).
The following shows the commands for evaluating on the D3
to COCO
.
python eval_retrieval.py --dataset D3 \
--save_predictions_file saved_files/D3/pred.pkl \
--save_sam_output_dir saved_files/D3/sam_output
python eval_reranking.py --dataset D3 \
--original_predictions_file saved_files/D3/pred.pkl \
--save_reranked_predictions_file saved_files/D3/pred_reranked.pkl
python eval_res.py --dataset D3 \
--retrieval_predictions_file saved_files/D3/pred_reranked.pkl \
--sam_output_dir saved_files/D3/sam_output
If you find this useful in your research, please cite our paper:
@inproceedings{shen2025mask,
title = {Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval},
author = {Shen, Li-Cheng and Hsieh, Jih-Kang and Li, Wei-Hua and Chen, Chu-Song},
booktitle = {Proceedings of the 2025 International Conference on Multimedia Retrieval},
pages = {2028–2032},
year = {2025}
}