Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval

This repository is the official implementation of our ICMR 2025 paper, "Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval".

Installation

conda create -n MaTIR python=3.12
conda activate MaTIR
pip install -r requirements.txt
git clone https://github.com/facebookresearch/sam2.git
mv sam2 SAM2 && cd SAM2 && pip install -e . && cd ..
git clone https://github.com/SunzeY/AlphaCLIP.git
cd AlphaCLIP && pip install -e . && cd ..

Optionally install flash attention:

pip install flash-attn --no-build-isolation

Download Alpha-CLIP checkpoint:

mkdir AlphaCLIP/checkpoints
wget -O AlphaCLIP/checkpoints/clip_l14_grit20m_fultune_2xe.pth \
https://download.openxlab.org.cn/models/SunzeY/AlphaCLIP/weight/clip_l14_grit20m_fultune_2xe.pth

Datasets

$D^3$ dataset is from "Described Object Detection: Liberating Object Detection with Flexible Expressions" (NeurIPS 2023), download from here
Organize as follows:

datasets/
├── coco/
|   ├── annotations/
|   └── val2017/
├── d3/
|   ├── d3_images/
|   ├── d3_json/
|   └── d3_pkl/
├── coco.py
└── d3.py

Evaluation

Our pipeline consists of three steps: Segmentation-aware text-to-image-retrieval (TIR), MLLM-based reranking, and referring expression segmentation (RES).

The following shows the commands for evaluating on the $D^3$ dataset. To evaluate on COCO, change each D3 to COCO.

1. Segmentation-aware Text-to-Image Retrieval (TIR)

python eval_retrieval.py --dataset D3 \
    --save_predictions_file saved_files/D3/pred.pkl \
    --save_sam_output_dir saved_files/D3/sam_output

2. MLLM-based Reranking

python eval_reranking.py --dataset D3 \
    --original_predictions_file saved_files/D3/pred.pkl \
    --save_reranked_predictions_file saved_files/D3/pred_reranked.pkl

3. Referring Expression Segmentation (RES)

python eval_res.py --dataset D3 \
    --retrieval_predictions_file saved_files/D3/pred_reranked.pkl \
    --sam_output_dir saved_files/D3/sam_output

Citation

If you find this useful in your research, please cite our paper:

@inproceedings{shen2025mask,
    title = {Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval},
    author = {Shen, Li-Cheng and Hsieh, Jih-Kang and Li, Wei-Hua and Chen, Chu-Song},
    booktitle = {Proceedings of the 2025 International Conference on Multimedia Retrieval},
    pages = {2028–2032},
    year = {2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval

Installation

Datasets

Evaluation

1. Segmentation-aware Text-to-Image Retrieval (TIR)

2. MLLM-based Reranking

3. Referring Expression Segmentation (RES)

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
MaTIR		MaTIR
datasets		datasets
pipeline		pipeline
utils		utils
LICENSE		LICENSE
README.md		README.md
eval_reranking.py		eval_reranking.py
eval_res.py		eval_res.py
eval_retrieval.py		eval_retrieval.py
requirements.txt		requirements.txt

License

AI-Application-and-Integration-Lab/MaTIR

Folders and files

Latest commit

History

Repository files navigation

Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval

Installation

Datasets

Evaluation

1. Segmentation-aware Text-to-Image Retrieval (TIR)

2. MLLM-based Reranking

3. Referring Expression Segmentation (RES)

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages