Skip to content

jongwooko/flex-judge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Flex mascot Flex mascot Flex-Judge Official Repository


Flex-Judge: Text-Only Reasoning Unleashes Zero-Shot Multimodal Evaluators (NeurIPS 2025)
Jongwoo Ko1*, Sungnyun Kim2*, Sungwoo Cho2, Se-Young Yun2
1 Microsoft, 2 KAIST AI, * equal contribution

  • We propose Flex-Judge, a reasoning-guided multimodal evaluator that leverages minimal textual reasoning data to robustly generalize across multiple modalities and evaluation formats.
  • Our framework highlights reasoning-based text supervision as a powerful, cost-effective alternative to traditional annotation-intensive approaches, substantially advancing scalable, multimodal model-as-a-judge.

teaser

🔧 Install Requirements

Our training codebase is built upon the s1 repo. The following steps will guide you through the installation process.

First, create a conda virtual environment using:

conda create -n flex python=3.9 && conda activate flex

You can then install the remaining package dependencies of s1 as follows:

pip install -r requirements.txt

If you want to train Flex-Omni-7B, you’ll need to install a specific version of transformers released for Qwen2.5-Omni-7B,

pip uninstall transformers
pip install git+https://github.com/huggingface/[email protected]
pip install accelerate

You will also need Flash Attention 2 installed, which can be done by running:

python -m pip install flash-attn --no-build-isolation

🚀 Generation

  1. Generate responses using the language model:
python utils/judgelrm_generation.py --seed 13 --temperature 0.1
  1. Select only high-quality :
python lrm_process.py

You should select the training samples based on longer reasoning and by considering format divergence. We have also attached the final data in data/train.jsonl.

🏋️ Training Flex-Judge

We provide four training config files for the four training setups reported in our paper. The training config is set for 2xA6000 or 4xA6000 GPUs. You may need to adjust num_processes and per_device_train_batch_size based on your computation environment.

  • Flex-VL-7B (4xA6000):
bash train/vl.sh
  • Flex-Omni-7B (4xA6000):
bash train/omni.sh

📊 Evaluation

For our evaluation benchmark, we use MLLM-as-a-Judge, VL-RewardBench, MJ-Bench, and GenAI-Bench for vision tasks. All responses are generated using vLLM.

To run Flex-VL-7B, create a conda environment with the latest version of vllm:

conda create -n vllm python=3.11 && conda activate vllm
pip install vllm==0.8.5

To run Flex-Omni-7B -- which is used for audio tasks such as NISQA, BVCC, SOMOS, and VoxSim, and is also compatible with vision tasks -- create a separate environment named vllm-omni:

conda create -n vllm-omni python=3.11 && conda activate vllm-omni
pip install git+https://github.com/huggingface/[email protected]
pip install vllm==0.8.5
  1. Generate responses from the prompts:
python test/{benchmark}_vllm.py --ckpt $CKPT --split $SPLIT

For example, you can generate judge results for image edition task of GenAI-Bench as follows:

python test/genai_bench_vllm.py --ckpt ckpts/vl --split editing

or, for NISQA speech quality assessment:

python test/nisqa_vllm.py --ckpt ckpts/omni --root /path/to/nisqa
  1. Evaluate the judge results against human (or GPT-4) evaluations. For example:
python test/evaluate.py --benchmark genai_bench --model ckpts/vl --split editing

Change benchmark to evaluate on other benchmarks.

📚 Resources

Here are the resources for evaluation benchamarks such as MLLM-as-a-Judge, MJ-Bench, GenAI-Bench, as follows:

Disclaimer: We're currently fixing our code for VL-RewardBench and MJ-Bench as the original datasets have recently been changed.

🧪 Examples

  • An example for assessing image generation

teaser

  • We also provide a molecule-specific evaluator called Flex-Mol-LLaMA. To use our Flex-Mol-LLaMA judge model for Best-of-N sampling and DPO training, refer to example/README.md.

teaser

📖 BibTeX

If you find this repo useful for your research, please consider citing our paper:

@article{ko2025flex,
  title={Flex-Judge: Text-Only Reasoning Unleashes Zero-Shot Multimodal Evaluators},
  author={Ko, Jongwoo and Kim, Sungnyun and Cho, Sungwoo and Yun, Se-Young},
  journal={arXiv preprint arXiv:2505.18601},
  year={2025}
}

📬 Contact

About

Official Code implementation of "Flex-Judge: Text-Only Reasoning Unleashes Zero-Shot Multimodal Evaluators"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •