🌐 English | 简体中文 | 繁體中文 | Español | Français | 日本語
If you find this project useful,
a star ⭐ on GitHub would be greatly appreciated!
ThinkSound is a unified Any2Audio generation framework with flow matching guided by Chain-of-Thought (CoT) reasoning.
PyTorch implementation for multimodal audio generation and editing: generate or edit audio from video, text, and audio, powered by step-by-step reasoning from Multimodal Large Language Models (MLLMs).
- 2025.07 🔥Online demo on Hugging Face Spaces and ModelScope for interactive experience!
- 2025.07 🔥Released inference scripts and web interface;
- 2025.06 🔥ThinkSound paper released on arXiv!
- 2025.06 🔥Online Demo is live - try it now!
- Any2Audio: Generate audio from arbitrary modalities — video, text, audio, or their combinations.
- Video-to-Audio SOTA: Achieves state-of-the-art results on multiple V2A benchmarks.
- CoT-Driven Reasoning: Chain-of-Thought reasoning for compositional and controllable audio generation via MLLMs.
- Interactive Object-centric Editing: Refine or edit specific sound events by clicking on visual objects or using text instructions.
- Unified Framework: One foundation model supports generation, editing, and interactive workflow.
ThinkSound decomposes audio generation and editing into three interactive stages, all guided by MLLM-based Chain-of-Thought (CoT) reasoning:
- Foley Generation: Generate foundational, semantically and temporally aligned soundscapes from video.
- Object-Centric Refinement: Refine or add sounds for user-specified objects via clicks or regions in the video.
- Targeted Audio Editing: Modify generated audio using high-level natural language instructions.
Environment Preparation:
git clone https://github.com/liuhuadai/ThinkSound.git
cd ThinkSound
pip install -r requirements.txt
conda install -y -c conda-forge 'ffmpeg<7'
# Download pretrained weights https://huggingface.co/liuhuadai/ThinkSound to Directory ckpts/
# model weights can be also downloaded from https://www.modelscope.cn/models/iic/ThinkSound
git lfs install
git clone https://huggingface.co/liuhuadai/ThinkSound ckpts
Make it executable
chmod +x scripts/demo.sh
Run the script
./scripts/demo.sh <video_path> <title> <CoT description> [use-half]
Add use-half at the end to enable half precision inference, which reduces GPU memory usage.
For an interactive experience, launch the Gradio web interface:
python app.py
- ☐ Release training scripts for ThinkSound models
- ☐ Open-source AudioCoT dataset and automated pipeline
- ☐ Provide detailed documentation and API reference
- ☐ Add support for additional modalities and downstream tasks
This project is released under the Apache 2.0 License.
Note:
The code, models, and dataset are for research and educational purposes only.
Commercial use is NOT permitted.For commercial licensing, please contact the authors.
If you find ThinkSound useful in your research or work, please cite our paper:
@misc{liu2025thinksoundchainofthoughtreasoningmultimodal,
title={ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing},
author={Huadai Liu and Jialei Wang and Kaicheng Luo and Wen Wang and Qian Chen and Zhou Zhao and Wei Xue},
year={2025},
eprint={2506.21448},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2506.21448},
}
✨ Feel free to open an issue or contact us via email ([email protected]) if you have any questions or suggestions!