Skip to content

SafeRL-Lab/m4r

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

41 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Logo

M4R: Measuring Massive Multimodal Understanding and Reasoning in Open Space

Website Β· Code Β· Leaderboard Β· Dataset Β· Dataset-Zip Β· Issue


Content


Leaderboard

Evaluation of Open Space (Land, Water, Air) domains using M4R benchmarks

Each reported number corresponds to the average score (overall, temporal, spatial, and intent reasoning).

Difficulty Models Size Over. Avg. Temporal Spatial Intent
Hard GPT 4o - 22.21 24.92 27.14 13.80
Hard Gemini 2.5 Pro πŸ₯‡ - 31.01 38.18 30.08 25.20
Hard Gemini 1.5 Pro - 19.07 22.53 21.57 17.25
Hard Claude 3.5 - 28.89 32.84 29.18 23.41
Hard InternVL2.5 26B 22.45 25.33 27.42 12.64
Hard InternVL2.5 8B 20.39 21.30 29.41 11.42
Hard InternVL2.5 4B 17.31 17.39 23.04 13.13
Hard LLaVA Next 32B 17.83 11.28 26.09 10.10
Hard LLaVA Video 7B 17.35 13.02 27.49 10.18
Hard LLaVA OneVision 7B 14.27 9.55 24.74 10.15
Hard Qwen2.5 VL 32B 19.39 13.19 27.85 14.05
Hard Qwen2.5 VL 7B 20.34 12.31 28.40 15.48
Medium GPT 4o πŸ₯‡ - 41.21 44.89 47.03 28.19
Medium Gemini 2.5 Pro - 41.07 41.31 48.33 33.06
Medium Gemini 1.5 Pro - 37.13 40.69 43.81 31.06
Medium Claude 3.5 - 37.99 36.46 47.34 31.09
Medium InternVL2.5 26B 36.39 37.85 47.51 27.55
Medium InternVL2.5 8B 35.44 39.85 51.07 18.98
Medium InternVL2.5 4B 36.53 31.21 45.36 32.68
Medium LLaVA Next 32B 21.07 13.57 33.08 14.24
Medium LLaVA Video 7B 24.04 19.33 30.50 19.72
Medium LLaVA OneVision 7B 17.76 17.81 24.71 17.12
Medium Qwen2.5 VL 32B 29.93 23.34 41.94 25.82
Medium Qwen2.5 VL 7B 28.79 22.18 34.64 22.89
Easy GPT 4o - 45.01 55.33 38.08 43.72
Easy Gemini 2.5 Pro πŸ₯‡ - 59.36 61.16 54.51 58.09
Easy Gemini 1.5 Pro - 48.05 53.22 47.85 45.37
Easy Claude 3.5 - 50.14 53.28 48.51 46.40
Easy InternVL2.5 26B 55.08 58.41 53.46 44.45
Easy InternVL2.5 8B 51.03 53.64 54.52 42.20
Easy InternVL2.5 4B 48.93 46.55 52.31 43.65
Easy LLaVA Next 32B 35.32 31.22 40.09 34.34
Easy LLaVA Video 7B 30.44 29.41 34.12 31.64
Easy LLaVA OneVision 7B 31.10 29.46 33.78 29.88
Easy Qwen2.5 VL 32B 48.35 50.68 47.82 44.97
Easy Qwen2.5 VL 7B 37.97 38.87 33.20 36.45

More results can be found at the link: https://open-space-reasoning.github.io/

About the Dataset:

This benchmark includes approximately 2,000 videos and 19,000 human-annotated question-answer pairs, covering a wide range of reasoning tasks (as shown in Figure 1). We provide a sample set (approximately 4K examples) for efficiency evaluation, randomly selected from the full dataset (19K examples). All annotations were performed by highly educated annotators, each holding at least a master's degree in engineering-related fields such as mathematics or computer science. The dataset features a variety of video lengths, categories, and frame counts, and spans three primary open-space reasoning scenarios: land space, water space, and air space. An overview of the dataset’s characteristics is shown in Figure 2, which illustrates the distributions of video duration, domain coverage, and reasoning styles. During annotation, we first design the hard-level tasks and label each question with the ground-truth answer. Based on these, we then construct the medium and easy tasks. The primary differences between difficulty levels lie in the number and types of answer choices. Details of the annotation procedure and difficulty levels are provided in our paper.

Dataset Format:

{
  "id": ,
  "dataset": "str",              // e.g., sub dataset filename
  "scene_name": "str",           // e.g., video filename
  "reasoning_style": "str",      // e.g., temporal_reasoning, intent_goal_reasoning, etc.
  "question": "str",             // The reasoning question related to the scene
  "ground_truth": "str",         // Correct answer key (e.g., "A", "B", etc.)
  "options": ["str", "str", "str", "str", "str", "str"]  // Multiple-choice options
}

One example from air space:

  {
    "id": 1,
    "dataset": "air_space_long",
    "scene_name": "air_space_long_1.mp4",
    "reasoning_style": "intent_goal_reasoning",
    "question": "How many moving airplanes are observed in this video?",
    "ground_truth": "A",
    "options": [
      "E. [0,1]",
      "C. [8,9]",
      "A. [4,5]",
      "D. [6,7]",
      "B. [10,11]",
      "F. [2,3]"
    ]
  }
Figure 1. A question and answer example: For each open-space reasoning setting, we include three types of video lengths: short, medium, and long. Each video length includes tasks designed to evaluate temporal reasoning, spatial reasoning, and intent reasoning.

Dataset Distribution:

Figure 2. Distribution of video and task properties in the M4R benchmark.

Three Space Settings

Figure 3. Examples of multimodal Understanding and Reasoning in Open-Space Scenarios.

Reasoning Settings:

Figure 4. Examples of reasoning question settings in M4R across three key reasoning types: Temporal Reasoning, which involves understanding event sequences and motion over time; Spatial Reasoning, which focuses on relative positioning and orientation in space; and Intent Reasoning, which evaluates understanding of goal-directed behaviors and decision-making in dynamic environments..

One Example in Land Space Settings:

Figure 5. Land-space traffic accident scenarios for open-space video understanding and reasoning include intersection collisions, urban road accidents, nighttime incidents, rural road accidents, snow-covered road collisions, and freeway accidents.

Installation

For development, you can install the package by cloning the repository and running the following command:

pip install uv
git clone [email protected]:SafeRL-Lab/m4r.git
cd m4r
uv venv dev
source dev/bin/activate
uv pip install -e .
uv pip install -U "qwen-vl-utils"   

Download Dataset

You can download the dataset directly from our Hugging Face repository.

git lfs install
git clone https://huggingface.co/datasets/Open-Space-Reasoning/M4R

If you encounter any issues during the download, we also provide a zipped version for convenience: Download Dataset (ZIP)

Basic Usage

Here's a basic evaluation example:

Download the dataset from Hugging Face, and set the dataset path to the corresponding task file. For example, specify the dataset path as /your-dataset-path/land_space/short/hard/spatial_reasoning.json in the task configuration file located at /Open-Space-Reasoning/lmms_eval/tasks/land_space_short/land_space_hard.yaml.

accelerate launch --num_processes=1 --main_process_port=12346 -m lmms_eval \
        --model qwen2_5_vl \
        --model_args=pretrained=Qwen/Qwen2.5-VL-7B-Instruct,max_pixels=12845056,use_flash_attention_2=False,interleave_visuals=True \
        --tasks land_space_hard \
        --batch_size 1 \
        --log_samples \
        --output_path /pasteur2/u/xhanwang/lmms-eval/outputs/land_space_hard/

Modify the following examples to test more models as the above script.

More examples can be found in examples/models

Evaluation of OpenAI-Compatible Model

bash examples/models/openai_compatible.sh
bash examples/models/xai_grok.sh

Evaluation of vLLM

bash examples/models/vllm_qwen2vl.sh

Evaluation of LLaVA-OneVision

bash examples/models/llava_onevision.sh

Evaluation of LLaMA-3.2-Vision

bash examples/models/llama_vision.sh

Evaluation of Qwen2-VL

bash examples/models/qwen2_vl.sh
bash examples/models/qwen2_5_vl.sh

Evaluation of LLaVA on MME

If you want to test LLaVA 1.5, you will have to clone their repo from LLaVA and

bash examples/models/llava_next.sh

Evaluation with tensor parallel for bigger model (llava-next-72b)

bash examples/models/tensor_parallel.sh

Evaluation with SGLang for bigger model (llava-next-72b)

bash examples/models/sglang.sh

Evaluation with vLLM for bigger model (llava-next-72b)

bash examples/models/vllm_qwen2vl.sh

More Parameters

python3 -m lmms_eval --help

Environmental Variables Before running experiments and evaluations, we recommend you to export following environment variables to your environment. Some are necessary for certain tasks to run.

export OPENAI_API_KEY="<YOUR_API_KEY>"
export HF_HOME="<Path to HF cache>" 
export HF_TOKEN="<YOUR_API_KEY>"
export HF_HUB_ENABLE_HF_TRANSFER="1"
export REKA_API_KEY="<YOUR_API_KEY>"
# Other possible environment variables include 
# ANTHROPIC_API_KEY,DASHSCOPE_API_KEY etc.

Common Environment Issues

Sometimes you might encounter some common issues for example error related to httpx or protobuf. To solve these issues, you can first try

python3 -m pip install httpx==0.23.3;
python3 -m pip install protobuf==3.20;
# If you are using numpy==2.x, sometimes may causing errors
python3 -m pip install numpy==1.26;
# Someties sentencepiece are required for tokenizer to work
python3 -m pip install sentencepiece;

Citation

If you find the repository useful, please cite the study

@article{gu2025m4r,
  title={Measuring Massive Multimodal Understanding and Reasoning in Open Space},
  author={Gu, Shangding and Wang, Xiaohan and Ying, Donghao and Zhao, Haoyu and Yang, Runing and Li, Boyi and Jin, Ming and Pavone, Marco and Yeung-Levy, Serena and Wang, Jun and Song, Dawn and Spanos, Costas},
  journal={Github},
  year={2025}
}

Acknowledgment

This repository is adapted from lmms-eval for use in our benchmark. We thank the contributors of lmms-eval for their efforts and contributions.

Releases

No releases published

Packages

No packages published