StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling

Meng Wei* Chenyang Wan* Xiqian Yu* Tai Wang*‡ Yuqiang Yang Xiaohan Mao Chenming Zhu Wenzhe Cai Hanqing Wang Yilun Chen Xihui Liu† Jiangmiao Pang†
Shanghai AI Laboratory The University of Hong Kong Zhejiang University Shanghai Jiao Tong University

🏠 About

StreamVLN generates action outputs from continuous video input in an online, multi-turn dialogue manner. Built on LLaVA-Video as the foundational Video-LLM, we extend it for interleaved vision, language, and action modeling. For both effective context modeling of long sequence and efficient computation for real-time interaction, StreamVLN has: (1) a fast-streaming dialogue context with a sliding-window KV cache; and (2) a slow-updating memory via token pruning.

📢 News

[2025-07-18] We’ve fixed a bug where num_history was not correctly passed to the model during evaluation, causing it to default to None. This had a significant impact on performance. Please make sure to pull the latest code for correct evaluation.

🛠 Getting Started

We test under the following environment:

Python 3.9
Pytorch 2.1.2
CUDA Version 12.4

Preparing a conda env with Python3.9 & Install habitat-sim and habitat-lab

conda create -n streamvln python=3.9
conda install habitat-sim==0.2.4 withbullet headless -c conda-forge -c aihabitat
git clone --branch v0.2.4 https://github.com/facebookresearch/habitat-lab.git
cd habitat-lab
pip install -e habitat-lab  # install habitat_lab
pip install -e habitat-baselines # install habitat_baselines

Clone this repository

git clone https://github.com/OpenRobotLab/StreamVLN.git
cd StreamVLN

📁 Data Preparation

To get started, you need to prepare three types of data:

Matterport3D (MP3D) Scenes
Download the MP3D scenes from the official project page, and place them under data/scene_datasets/mp3d/.
VLN-CE Episodes
Download the VLN-CE episodes:
- r2r (rename R2R_VLNCE_v1/ -> r2r/)
- rxr (rename RxR_VLNCE_v0/ -> rxr/)
- envdrop (rename R2R_VLNCE_v1-3_preprocessed/envdrop/ -> envdrop/)
Extract them into the data/datasets/ directory.
Collected Trajectory Data
We provide pre-collected observation-action trajectory data for training. These trajectories were collected using the training episodes from R2R and RxR under the Matterport3D environment. For the EnvDrop subset, please refer to DATASET.md for instructions on how to collect it yourself. Download the observation-action trajectory data from Hugging Face, and extract it to data/trajectory_data/.

Your final folder structure should look like this:

data/
├── datasets/
│   ├── r2r/
│   │   ├── train/
│   │   ├── val_seen/
│   │   │   └── val_seen.json.gz
│   │   └── val_unseen/
│   │       └── val_unseen.json.gz
│   ├── rxr/
│   │   ├── train/
│   │   ├── val_seen/
│   │   │   ├── val_seen_guide.json.gz
│   │   │   └── ...
│   │   └── val_unseen/
│   │       ├── val_unseen_guide.json.gz
│   │       └── ...
│   └── envdrop/
│       ├── envdrop.json.gz
│       └── ...
│
├── scene_datasets/
│   └── mp3d/
│       ├── 17DRP5sb8fy/
│       ├── 1LXtFkjw3qL/
│       └── ...
└── trajectory_data/
    ├── R2R/
    │   ├── images/
    │   └── annotations.json
    ├── RxR/
    │   ├── images/
    │   └── annotations.json
    └── EnvDrop/
        ├── images/
        └── annotations.json

🏆 Model Zoo

We provide two model checkpoints for different use cases:

Benchmark Reproduction
Use this checkpoint to reproduce results on the VLN-CE benchmark.
Real-World Deployment
This checkpoint is recommended for deployment on physical robots.

We made two modifications:
1. Remove redundant initial turn actions: The initial left/right turns not mentioned in the instructions are removed for better instruction alignment.
2. Trajectory safety: Enhanced obstacle avoidance ensures more reliable navigation in real-world environments.

🚀 Training

To perform multi-node multi-GPU training with distributed setup, run:

sbatch scripts/streamvln_train_slurm.sh

🤖 Evaluation

To perform multi-GPU evaluation with key-value cache support, simply run:

sh scripts/streamvln_eval_multi_gpu.sh

📝 TODO List

✅ Release the arXiv paper (Jul. 8, 2025)
✅ Provide inference scripts and model checkpoints
✅ Release training code and configurations
✅ Release training data
⏳ Support co-training with LLaVA-Video-178K, ScanQA, MMC4
⏳ Dagger data collection

🙋‍♂️ Questions or Issues

If you encounter any problems or have questions about StreamVLN, please feel free to open an issue.

🔗 Citation

If you find our work helpful, please consider starring this repo 🌟 and cite:

@article{wei2025streamvln,
  title={StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling},
  author={Wei, Meng and Wan, Chenyang and Yu, Xiqian and Wang, Tai and Yang, Yuqiang and Mao, Xiaohan and Zhu, Chenming and Cai, Wenzhe and Wang, Hanqing and Chen, Yilun and others},
  journal={arXiv preprint arXiv:2507.05240},
  year={2025}
}

📄 License

This work is under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

👏 Acknowledgements

This repo is based on LLaVA-NeXT.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
config		config
llava		llava
scripts		scripts
streamvln		streamvln
trl		trl
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling

🏠 About

📢 News

🛠 Getting Started

📁 Data Preparation

🏆 Model Zoo

🚀 Training

🤖 Evaluation

📝 TODO List

🙋‍♂️ Questions or Issues

🔗 Citation

📄 License

👏 Acknowledgements

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

OpenRobotLab/StreamVLN

Folders and files

Latest commit

History

Repository files navigation

StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling

🏠 About

📢 News

🛠 Getting Started

📁 Data Preparation

🏆 Model Zoo

🚀 Training

🤖 Evaluation

📝 TODO List

🙋‍♂️ Questions or Issues

🔗 Citation

📄 License

👏 Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages