This is an evolving repo for the survey: Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey. If you find our survey useful for your research, please 📚cite📚 the following paper:
@article{xie2024towards,
title={Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey},
author={Xie, Tianxin and Rong, Yan and Zhang, Pengfei and Wang, Wenwu and Liu, Li},
journal={arXiv preprint arXiv:2412.06602},
year={2024}
}
If you find any mistakes, please don’t hesitate to open an issue.
Below are representative non-autoregressive controllable TTS methods. Each entry follows this format: method name, zero-shot capability, controllability, acoustic model, vocoder, acoustic feature, release date, and code/demo.
NOTE: MelS and LinS represent Mel Spectrogram and Linear Spectrogram, respectively. Among today’s TTS systems, MelS, latent features (from VAEs, diffusion models, and other flow-based methods), and various types of discrete tokens are the most commonly used acoustic representations.
- ProEmo, Zero-shot (✗), Controllability (Pitch, Energy, Emotion, Description), Transformer, HiFi-GAN, MelS, 2025.01, Code
- DrawSpeech, Zero-shot (✗), Controllability (Energy, Prosody), Diffusion, HiFi-GAN, MelS, 2025.01, Demo, Code
- DiffStyleTTS, Zero-shot (✗), Controllability (Pitch, Energy, Speed, Prosody, Timbre), Transformer + Diffusion, HiFi-GAN, MelS, 2025.01, Demo
- HED, Zero-shot (✓), Controllability (Emotion), Flow-based Diffusion, Vocos, MelS, 2024.12, Demo
- EmoDubber, Zero-shot (✓), Controllability (Prosody, Timbre, Emotion), Transformer + Flow, Flow-based Vocoder, MelS, 2024.12, Demo
- EmoSphere++, Zero-shot (✓), Controllability (Prosody, Timbre, Emotion), Transformer + Flow, BigVGAN, MelS, 2024.11, Demo, Code
- MS$^{2}$KU-VTTS, Zero-shot (✗), Controllability (Environment, Description), Diffusion, BigvGAN, MelS, 2024.10
- NanoVoice, Zero-shot (✓), Controllability (Timbre), Diffusion, BigVGAN, MelS, 2024.09
- NansyTTS, Zero-shot (✓), Controllability (Pitch, Speed, Prosody, Timbre, Description), Transformer, NANSY++, MelS, 2024.09, Demo
- StyleTTS-ZS, Zero-shot (✓), Controllability (Timbre), Flow-based Diffusion + GAN, Mel-based Decoder, MelS, 2024.09, Demo
- E1 TTS, Zero-shot (✓), Controllability (Timbre), DiT + Flow, BigVGAN, Token + MelS, 2024.09, Demo
- SimpleSpeech 2, Zero-shot (✓), Controllability (Speed, Timbre), Flow-based DiT, SQ Codec, Token, 2024.08, Demo, Code
- CCSP, Zero-shot (✓), Controllability (Timbre), Diffusion, RVQ-based Codec, Token, 2024.07, Demo
- ArtSpeech, Zero-shot (✓), Controllability (Timbre), RNN + CNN, HiFI-GAN, MelS, 2024.07, Demo, Code
- DEX-TTS, Zero-shot (✓), Controllability (Timbre), Diffusion, HiFi-GAN, MelS, 2024.06, Code
- MobileSpeech, Zero-shot (✓), Controllability (Timbre), Transformer, Vocos, Token, 2024.06, Demo
- E2 TTS, Zero-shot (✓), Controllability (Timbre), Transformer + Flow, BigVGAN, MelS, 2024.06, Demo, Code (unofficial)
- DiTTo-TTS, Zero-shot (✓), Controllability (Speed, Timbre), DiT + VAE, BigVGAN, MelS, 2024.06, Demo
- SimpleSpeech, Zero-shot (✓), Controllability (Timbre), Transformer + Diffusion, SQ Codec, Token, 2024.06, Demo, Code
- AST-LDM, Zero-shot (✗), Controllability (Timbre, Environment, Description), Diffusion + VAE, HiFi-GAN, MelS, 2024.06, Demo
- ControlSpeech, Zero-shot (✓), Controllability (Pitch, Energy, Speed, Prosody, Timbre, Emotion, Description), Transformer + Diffusion, FACodec, Token, 2024.06, Demo, Code
- InstructTTS, Zero-shot (✗), Controllability (Pitch, Speed, Prosody, Timbre, Emotion, Description), Transformer + Diffusion, HiFi-GAN, Token, 2024.05, Demo
- NaturalSpeech 3, Zero-shot (✓), Controllability (Speed, Prosody, Timbre), Transformer + Diffusion, FACodec, Token, 2024.04, Demo
- FlashSpeech, Zero-shot (✓), Controllability (Timbre), Latent Consistency Model, EnCodec, Token, 2024.04, Demo, Code
- Audiobox, Zero-shot (✓), Controllability (Pitch, Speed, Prosody, Timbre, Environment, Description), Transformer + Flow, EnCodec, MelS, 2023.12, Demo
- HierSpeech++, Zero-shot (✓), Controllability (Timbre), Transformer + VAE + Flow, BigVGAN, MelS, 2023.11, Demo, Code
- E3 TTS, Zero-shot (✓), Controllability (Timbre), Diffusion, Not required, Waveform, 2023.11, Demo
- P-Flow, Zero-shot (✓), Controllability (Timbre), Transformer + Flow, HiFi-GAN, MelS, 2023.10, Demo, Code (unofficial)
- SpeechFlow, Zero-shot (✓), Controllability (Timbre), Transformer + Flow, HiFi-GAN, MelS, 2023.10, Demo
- PromptTTS++, Zero-shot (✗), Controllability (Pitch, Speed, Prosody, Timbre, Emotion, Description), Transformer + Diffusion, BigVGAN, MelS, 2023.09, Demo, Code
- DuIAN-E, Zero-shot (✗), Controllability (Pitch, Speed, Prosody), CNN + RNN, HiFi-GAN, MelS, 2023.09, Demo
- VoiceLDM, Zero-shot (✗), Controllability (Pitch, Prosody, Timbre, Emotion, Environment, Description), Diffusion, HiFi-GAN, MelS, 2023.09, Demo, Code
- PromptTTS 2, Zero-shot (✗), Controllability (Pitch, Energy, Speed, Prosody, Timbre, Description), Diffusion, RVQ-based Codec, Latent Feature, 2023.09, Demo
- MegaTTS 2, Zero-shot (✓), Controllability (Prosody, Timbre, Emotion), Decoder-only Transformer + GAN, HiFi-GAN, MelS, 2023.07, Demo, Code (unofficial)
- VoiceBox, Zero-shot (✓), Controllability (Timbre), Transformer + Flow, HiFi-GAN, MelS, 2023.06, Demo, Code (unofficial)
- StyleTTS 2, Zero-shot (✓), Controllability (Prosody, Timbre, Emotion), Flow-based Diffusion + GAN, HiFi-GAN / iSTFTNet, MelS, 2023.06, Demo, Code
- PromptStyle, Zero-shot (✓), Controllability (Pitch, Prosody, Timbre, Emotion, Description), VITS + Flow, HiFi-GAN, MelS, 2023.05, Demo
- NaturalSpeech 2, Zero-shot (✓), Controllability (Timbre), Diffusion, RVQ-based Codec, Token, 2023.04, Demo, Code (unofficial)
- Grad-StyleSpeech, Zero-shot (✓), Controllability (Timbre), Score-based Diffusion, HiFi-GAN, MelS, 2022.11, Demo
- PromptTTS, Zero-shot (✗), Controllability (Pitch, Energy, Speed, Prosody, Timbre, Emotion, Description), Bert + Transformer, HiFi-GAN, MelS, 2022.11, Demo
- CLONE, Zero-shot (✗), Controllability (Pitch, Speed, Prosody), Transformer + CNN, WaveNet, MelS + LinS, 2022.07, Demo
- Cauliflow, Zero-shot (✗), Controllability (Speed, Prosody), BERT + Flow, UP WaveNet, MelS, 2022.06
- GenerSpeech, Zero-shot (✓), Controllability (Timbre), Transformer + Flow, HiFi-GAN, MelS, 2022.05, Demo
- StyleTTS, Zero-shot (✓), Controllability (Timbre), CNN + RNN, HiFi-GAN, MelS, 2022.05, Code
- YourTTS, Zero-shot (✓), Controllability (Timbre), Transformer + Flow, HiFi-GAN, LinS, 2021.12, Demo & Checkpoint
- DelightfulTTS, Zero-shot (✗), Controllability (Pitch, Speed, Prosody), Transformer + CNN, HiFiNet, MelS, 2021.11, Demo
- Meta-StyleSpeech, Zero-shot (✓), Controllability (Timbre), Transformer, MelGAN, MelS, 2021.06, Code
- SC-GlowTTS, Zero-shot (✓), Controllability (Timbre), Transformer + Flow, HiFi-GAN, MelS, 2021.06, Demo, Code
- StyleTagging-TTS, Zero-shot (✓), Controllability (Timbre, Emotion), Transformer + CNN, HiFi-GAN, MelS, 2021.04, Demo
- Parallel Tacotron, Zero-shot (✗), Controllability (Prosody), Transformer + CNN, WaveRNN, MelS, 2020.10, Demo
- FastPitch, Zero-shot (✗), Controllability (Pitch, Prosody), Transformer, WaveGlow, MelS, 2020.06, Code
- FastSpeech 2, Zero-shot (✗), Controllability (Pitch, Energy, Speed, Prosody), Transformer, Parallel WaveGAN, MelS, 2020.06, Code (unofficial)
- FastSpeech, Zero-shot (✗), Controllability (Speed, Prosody), Transformer, WaveGlow, MelS, 2019.05, Code (unofficial)
Below are representative non-autoregressive controllable TTS methods. Each entry follows this format: method name, zero-shot capability, controllability, acoustic model, vocoder, acoustic feature, release date, and code/demo.
NOTE: MelS and LinS represent Mel Spectrogram and Linear Spectrogram, respectively. Among today’s TTS systems, MelS, latent features (from VAEs, diffusion models, and other flow-based methods), and various types of discrete tokens are the most commonly used acoustic representations.
- EmoVoice, Zero-shot (✗), Controllability (Emotion, Description), Decoder-only Transformer, HiFi-GAN, Token, 2025.04, Demo
- Spark-TTS, Zero-shot (✓), Controllability (Pitch, Speed, Prosody, Timbre), Decoder-only Transformer, BiCodec, Token, 2025.03, Code
- Vevo, Zero-shot (✓), Controllability (Pitch, Energy, Speed, Prosody, Timbre, Emotion), Decoder-only Transformer, BigVGAN, Token + MelS, 2025.02, Demo, Code
- Step-Audio, Zero-shot (✓), Controllability (Prosody, Timbre, Emotion, Description), Decoder-only Transformer, Flow-based Vocoder, Token, 2025.02, Code
- FleSpeech, Zero-shot (✓), Controllability (Pitch, Energy, Speed, Prosody, Timbre, Emotion, Description), Flow-based DiT, WaveGAN, Latent Feature, 2025.01, Demo
- IDEA-TTS, Zero-shot (✓), Controllability (Timbre, Environment), Transformer, Flow-based Vocoder, LinS + MelS, 2024.12, Demo, Code
- KALL-E, Zero-shot (✓), Controllability (Prosody, Timbre, Emotion), Decoder-only Transformer, WaveVAE, Latent Feature, 2024.12, Demo
- IST-LM, Zero-shot (✓), Controllability (Prosody, Timbre), Decoder-only Transformer, HiFi-GAN, Token + MelS, 2024.12
- SLAM-Omni, Zero-shot (✓), Controllability (Prosody, Timbre), Decoder-only Transformer, HiFi-GAN, Token + MelS, 2024.12, Demo, Code
- FishSpeech, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, Firefly-GAN,Token, 2024.11, Code
- HALL-E, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, EnCodec, Token, 2024.10
- Takin, Zero-shot (✓), Controllability (Pitch, Speed, Prosody, Timbre, Emotion, Description), Decoder-only Transformer + Flow, HiFi-GAN, Token + MelS, 2024.09, Demo
- CoFi-Speech, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, BigVGAN, Token + MelS, 2024.09, Demo
- FireRedTTS, Zero-shot (✓), Controllability (Prosody, Timbre), Decoder-only Transformer + Flow, BigVGAN-v2, Token + MelS, 2024.09, Demo, Code
- Emo-DPO, Zero-shot (✗), Controllability (Emotion), Decoder-only Transformer, HiFi-GAN, Token + MelS, 2024.09, Demo
- VoxInstruct, Zero-shot (✓), Controllability (Pitch, Energy, Speed, Prosody, Timbre, Emotion, Description), Decoder-only Transformer, Vocos, Token, 2024.08, Demo, Code
- MELLE, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, HiFi-GAN, MelS, 2024.07. Demo
- CosyVoice, Zero-shot (✓), Controllability (Pitch, Speed, Prosody, Timbre, Emotion, Description), Decoder-only Transformer + Flow, HiFi-GAN, Token, 2024.07, Demo, Code
- XTTS, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer + GAN, HiFi-GAN-based Vococder, Token + MelS, 2024.06, Demo, Code
- VoiceCraft, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, HiFi-GAN, Token, 2024.06, Code
- Seed-TTS, Zero-shot (✓), Controllability (Timbre, Emotion), Decoder-only Transformer + DiT, Unknown Vocoder, Latent Feature, 2024.06, Demo
- VALL-E 2, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, Vocos, Token, 2024.06, Demo, Code (unofficial 1), Code (unofficial 2)
- VALL-E R, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, Vocos, Token, 2024.06, Demo
- ARDiT, Zero-shot (✓), Controllability (Speed, Timbre), Decoder-only DiT, BigVGAN, MelS, 2024.06, Demo
- RALL-E, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, SoundStream, Token, 2024.05, Demo
- CLaM-TTS, Zero-shot (✓), Controllability (Timbre), Encoder-decoder Transformer, BigVGAN, Token + MelS, 2024.04, Demo
- BaseTTS, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, Speechcode Decoder, Token, 2024.02, Demo
- ELLA-V, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, EnCodec, Token, 2024.01, Demo
- UniAudio, Zero-shot (✓), Controllability (Pitch, Speed, Prosody, Timbre, Description), Decoder-only Transformer, UniAudio Codec, Token, 2023.10, Demo, Code
- Salle, Zero-shot (✗), Controllability (Pitch, Energy, Speed, Prosody, Timbre, Emotion, Description), Decoder-only Transformer, EnCodec, Token, 2023.08, Demo
- SC VALL-E, Zero-shot (✓), Controllability (Pitch, Energy, Speed, Prosody, Timbre, Emotion), Decoder-only Transformer, EnCodec, Token, 2023.07, Demo, Code
- MegaTTS, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer + GAN, HiFi-GAN, MelS, 2023.06, Demo
- TorToise, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer + Diffusion, UnivNet, MelS, 2023.05, Code
- Make-a-voice, Zero-shot (✓), Controllability (Timbre), Encoder-decoder Transformer, Unit-based Vocoder, Token, 2023.05, Demo
- VALL-E X, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, EnCodec, Token, 2023.03, Demo, Code (unofficial)
- SpearTTS, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, SoundStream, Token, 2023.02, Demo, Code (unofficial)
- VALL-E, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, EnCodec, Token, 2023.01, Demo, Code (unofficial 1), Code (unofficial 2)
- MsEmoTTS, Zero-shot (✓), Controllability (Pitch, Prosody, Emotion), CNN + RNN, WaveRNN, MelS, 2022.01, Demo
- Flowtron, Zero-shot (✗), Controllability (Pitch, Speed, Prosody), CNN + RNN, WaveGlow, MelS, 2020.07, Demo, Code
- DurIAN, Zero-shot (✗), Controllability (Pitch, Speed, Prosody), CNN + RNN, MB-WaveRNN, MelS, 2019.09, Demo, Code (unofficial)
- VAE-Tacotron, Zero-shot (✗), Controllability (Pitch, Speed, Prosody), VAE, WaveNet, MelS, 2019.02, Code (unoffcial 1), Code (unoffcial 2)
- GMVAE-Tacotron, Zero-shot (✗), Controllability (Pitch, Speed, Prosody, Description), VAE, WaveRNN, MelS, 2018.12, Demo, Code (unofficial)
- GST-Tacotron, Zero-shot (✗), Controllability (Pitch, Prosody), CNN + RNN, Griffin-Lim, LinS, 2018.03, Demo, Code (unofficial)
- Prosody-Tacotron, Zero-shot (✗), Controllability (Pitch, Prosody), RNN, WaveNet, MelS, 2018.03, Demo
A summary of open-source datasets for controllable TTS:
Dataset | Hours | #Speakers | Labels | Lang | Release Time |
||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Pit. | Ene. | Spe. | Age | Gen. | Emo. | Emp. | Acc. | Top. | Des. | Dia. | |||||
SpeechCraft | 2,391 | 3,200 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | en,zh | 2024 | ||
Parler-TTS | 50,000 | / | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | en | 2024 | |||||
MSceneSpeech | 13 | 13 | ✓ | zh | 2024 | ||||||||||
VccmDataset | 330 | 1,324 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | en | 2024 | |||||
CLESC | <1 | / | ✓ | ✓ | ✓ | ✓ | en | 2024 | |||||||
TextrolSpeech | 330 | 1,324 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | en | 2023 | |||||
DailyTalk | 20 | 2 | ✓ | ✓ | ✓ | en | 2023 | ||||||||
MagicData-RAMC | 180 | 663 | ✓ | ✓ | zh | 2022 | |||||||||
PromptSpeech | / | / | ✓ | ✓ | ✓ | ✓ | ✓ | en | 2022 | ||||||
WenetSpeech | 10,000 | / | ✓ | zh | 2021 | ||||||||||
GigaSpeech | 10,000 | / | ✓ | en | 2021 | ||||||||||
ESD | 29 | 10 | ✓ | en,zh | 2021 | ||||||||||
CommonVoice | 2,500 | 50,000 | ✓ | ✓ | ✓ | multi | 2020 | ||||||||
AISHELL-3 | 85 | 218 | ✓ | ✓ | ✓ | zh | 2020 | ||||||||
Taskmaster-1 | / | / | ✓ | en | 2019 | ||||||||||
CMU-MOSEI | 65 | 1,000 | ✓ | en | 2018 | ||||||||||
RAVDESS | / | 24 | ✓ | ✓ | en | 2018 | |||||||||
RECOLA | 3.8 | 46 | ✓ | fr | 2013 | ||||||||||
IEMOCAP | 12 | 10 | ✓ | ✓ | ✓ | ✓ | ✓ | en | 2008 |
Abbreviations: Pit(ch), Ene(rgy)=volume, Spe(ed), Gen(der), Emo(tion), Emp(hasis), Acc(ent), Top(ic), Des(cription), Env(ironment), Dia(logue).
Metric | Type | Eval Target | GT Required |
---|---|---|---|
Mel-Cepstral Distortion (MCD) |
Objective | Acoustic similarity | ✓ |
Frequency Domain Score Difference (FDSD) |
Objective | Acoustic similarity | ✓ |
Word Error Rate (WER) |
Objective | Intelligibility | ✓ |
Cosine Similarity |
Objective | Speaker similarity | ✓ |
Perceptual Evaluation of Speech Quality (PESQ) |
Objective | Perceptual quality | ✓ |
Signal-to-Noise Ratio (SNR) |
Objective | Perceptual quality | ✓ |
Mean Opinion Score (MOS) |
Subjective | Preference | |
Comparison Mean Opinion Score (CMOS) |
Subjective | Preference | |
AB Test | Subjective | Preference | |
ABX Test | Subjective | Perceptual similarity | ✓ |
GT: Ground truth,