Awesome Controllabe Speech Synthesis

This is an evolving repo for the survey: Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey. If you find our survey useful for your research, please 📚cite📚 the following paper:

@article{xie2024towards,
  title={Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey},
  author={Xie, Tianxin and Rong, Yan and Zhang, Pengfei and Wang, Wenwu and Liu, Li},
  journal={arXiv preprint arXiv:2412.06602},
  year={2024}
}

If you find any mistakes, please don’t hesitate to open an issue.

🚀 Non-autoregressive Controllable TTS

Below are representative non-autoregressive controllable TTS methods. Each entry follows this format: method name, zero-shot capability, controllability, acoustic model, vocoder, acoustic feature, release date, and code/demo.

NOTE: MelS and LinS represent Mel Spectrogram and Linear Spectrogram, respectively. Among today’s TTS systems, MelS, latent features (from VAEs, diffusion models, and other flow-based methods), and various types of discrete tokens are the most commonly used acoustic representations.

ProEmo, Zero-shot (✗), Controllability (Pitch, Energy, Emotion, Description), Transformer, HiFi-GAN, MelS, 2025.01, Code
DrawSpeech, Zero-shot (✗), Controllability (Energy, Prosody), Diffusion, HiFi-GAN, MelS, 2025.01, Demo, Code
DiffStyleTTS, Zero-shot (✗), Controllability (Pitch, Energy, Speed, Prosody, Timbre), Transformer + Diffusion, HiFi-GAN, MelS, 2025.01, Demo
HED, Zero-shot (✓), Controllability (Emotion), Flow-based Diffusion, Vocos, MelS, 2024.12, Demo
EmoDubber, Zero-shot (✓), Controllability (Prosody, Timbre, Emotion), Transformer + Flow, Flow-based Vocoder, MelS, 2024.12, Demo
EmoSphere++, Zero-shot (✓), Controllability (Prosody, Timbre, Emotion), Transformer + Flow, BigVGAN, MelS, 2024.11, Demo, Code
MS$^{2}$KU-VTTS, Zero-shot (✗), Controllability (Environment, Description), Diffusion, BigvGAN, MelS, 2024.10
NanoVoice, Zero-shot (✓), Controllability (Timbre), Diffusion, BigVGAN, MelS, 2024.09
NansyTTS, Zero-shot (✓), Controllability (Pitch, Speed, Prosody, Timbre, Description), Transformer, NANSY++, MelS, 2024.09, Demo
StyleTTS-ZS, Zero-shot (✓), Controllability (Timbre), Flow-based Diffusion + GAN, Mel-based Decoder, MelS, 2024.09, Demo
E1 TTS, Zero-shot (✓), Controllability (Timbre), DiT + Flow, BigVGAN, Token + MelS, 2024.09, Demo
SimpleSpeech 2, Zero-shot (✓), Controllability (Speed, Timbre), Flow-based DiT, SQ Codec, Token, 2024.08, Demo, Code
CCSP, Zero-shot (✓), Controllability (Timbre), Diffusion, RVQ-based Codec, Token, 2024.07, Demo
ArtSpeech, Zero-shot (✓), Controllability (Timbre), RNN + CNN, HiFI-GAN, MelS, 2024.07, Demo, Code
DEX-TTS, Zero-shot (✓), Controllability (Timbre), Diffusion, HiFi-GAN, MelS, 2024.06, Code
MobileSpeech, Zero-shot (✓), Controllability (Timbre), Transformer, Vocos, Token, 2024.06, Demo
E2 TTS, Zero-shot (✓), Controllability (Timbre), Transformer + Flow, BigVGAN, MelS, 2024.06, Demo, Code (unofficial)
DiTTo-TTS, Zero-shot (✓), Controllability (Speed, Timbre), DiT + VAE, BigVGAN, MelS, 2024.06, Demo
SimpleSpeech, Zero-shot (✓), Controllability (Timbre), Transformer + Diffusion, SQ Codec, Token, 2024.06, Demo, Code
AST-LDM, Zero-shot (✗), Controllability (Timbre, Environment, Description), Diffusion + VAE, HiFi-GAN, MelS, 2024.06, Demo
ControlSpeech, Zero-shot (✓), Controllability (Pitch, Energy, Speed, Prosody, Timbre, Emotion, Description), Transformer + Diffusion, FACodec, Token, 2024.06, Demo, Code
InstructTTS, Zero-shot (✗), Controllability (Pitch, Speed, Prosody, Timbre, Emotion, Description), Transformer + Diffusion, HiFi-GAN, Token, 2024.05, Demo
NaturalSpeech 3, Zero-shot (✓), Controllability (Speed, Prosody, Timbre), Transformer + Diffusion, FACodec, Token, 2024.04, Demo
FlashSpeech, Zero-shot (✓), Controllability (Timbre), Latent Consistency Model, EnCodec, Token, 2024.04, Demo, Code
Audiobox, Zero-shot (✓), Controllability (Pitch, Speed, Prosody, Timbre, Environment, Description), Transformer + Flow, EnCodec, MelS, 2023.12, Demo
HierSpeech++, Zero-shot (✓), Controllability (Timbre), Transformer + VAE + Flow, BigVGAN, MelS, 2023.11, Demo, Code
E3 TTS, Zero-shot (✓), Controllability (Timbre), Diffusion, Not required, Waveform, 2023.11, Demo
P-Flow, Zero-shot (✓), Controllability (Timbre), Transformer + Flow, HiFi-GAN, MelS, 2023.10, Demo, Code (unofficial)
SpeechFlow, Zero-shot (✓), Controllability (Timbre), Transformer + Flow, HiFi-GAN, MelS, 2023.10, Demo
PromptTTS++, Zero-shot (✗), Controllability (Pitch, Speed, Prosody, Timbre, Emotion, Description), Transformer + Diffusion, BigVGAN, MelS, 2023.09, Demo, Code
DuIAN-E, Zero-shot (✗), Controllability (Pitch, Speed, Prosody), CNN + RNN, HiFi-GAN, MelS, 2023.09, Demo
VoiceLDM, Zero-shot (✗), Controllability (Pitch, Prosody, Timbre, Emotion, Environment, Description), Diffusion, HiFi-GAN, MelS, 2023.09, Demo, Code
PromptTTS 2, Zero-shot (✗), Controllability (Pitch, Energy, Speed, Prosody, Timbre, Description), Diffusion, RVQ-based Codec, Latent Feature, 2023.09, Demo
MegaTTS 2, Zero-shot (✓), Controllability (Prosody, Timbre, Emotion), Decoder-only Transformer + GAN, HiFi-GAN, MelS, 2023.07, Demo, Code (unofficial)
VoiceBox, Zero-shot (✓), Controllability (Timbre), Transformer + Flow, HiFi-GAN, MelS, 2023.06, Demo, Code (unofficial)
StyleTTS 2, Zero-shot (✓), Controllability (Prosody, Timbre, Emotion), Flow-based Diffusion + GAN, HiFi-GAN / iSTFTNet, MelS, 2023.06, Demo, Code
PromptStyle, Zero-shot (✓), Controllability (Pitch, Prosody, Timbre, Emotion, Description), VITS + Flow, HiFi-GAN, MelS, 2023.05, Demo
NaturalSpeech 2, Zero-shot (✓), Controllability (Timbre), Diffusion, RVQ-based Codec, Token, 2023.04, Demo, Code (unofficial)
Grad-StyleSpeech, Zero-shot (✓), Controllability (Timbre), Score-based Diffusion, HiFi-GAN, MelS, 2022.11, Demo
PromptTTS, Zero-shot (✗), Controllability (Pitch, Energy, Speed, Prosody, Timbre, Emotion, Description), Bert + Transformer, HiFi-GAN, MelS, 2022.11, Demo
CLONE, Zero-shot (✗), Controllability (Pitch, Speed, Prosody), Transformer + CNN, WaveNet, MelS + LinS, 2022.07, Demo
Cauliflow, Zero-shot (✗), Controllability (Speed, Prosody), BERT + Flow, UP WaveNet, MelS, 2022.06
GenerSpeech, Zero-shot (✓), Controllability (Timbre), Transformer + Flow, HiFi-GAN, MelS, 2022.05, Demo
StyleTTS, Zero-shot (✓), Controllability (Timbre), CNN + RNN, HiFi-GAN, MelS, 2022.05, Code
YourTTS, Zero-shot (✓), Controllability (Timbre), Transformer + Flow, HiFi-GAN, LinS, 2021.12, Demo & Checkpoint
DelightfulTTS, Zero-shot (✗), Controllability (Pitch, Speed, Prosody), Transformer + CNN, HiFiNet, MelS, 2021.11, Demo
Meta-StyleSpeech, Zero-shot (✓), Controllability (Timbre), Transformer, MelGAN, MelS, 2021.06, Code
SC-GlowTTS, Zero-shot (✓), Controllability (Timbre), Transformer + Flow, HiFi-GAN, MelS, 2021.06, Demo, Code
StyleTagging-TTS, Zero-shot (✓), Controllability (Timbre, Emotion), Transformer + CNN, HiFi-GAN, MelS, 2021.04, Demo
Parallel Tacotron, Zero-shot (✗), Controllability (Prosody), Transformer + CNN, WaveRNN, MelS, 2020.10, Demo
FastPitch, Zero-shot (✗), Controllability (Pitch, Prosody), Transformer, WaveGlow, MelS, 2020.06, Code
FastSpeech 2, Zero-shot (✗), Controllability (Pitch, Energy, Speed, Prosody), Transformer, Parallel WaveGAN, MelS, 2020.06, Code (unofficial)
FastSpeech, Zero-shot (✗), Controllability (Speed, Prosody), Transformer, WaveGlow, MelS, 2019.05, Code (unofficial)

🎞️ Autoregressive Controllable TTS

Below are representative non-autoregressive controllable TTS methods. Each entry follows this format: method name, zero-shot capability, controllability, acoustic model, vocoder, acoustic feature, release date, and code/demo.

NOTE: MelS and LinS represent Mel Spectrogram and Linear Spectrogram, respectively. Among today’s TTS systems, MelS, latent features (from VAEs, diffusion models, and other flow-based methods), and various types of discrete tokens are the most commonly used acoustic representations.

EmoVoice, Zero-shot (✗), Controllability (Emotion, Description), Decoder-only Transformer, HiFi-GAN, Token, 2025.04, Demo
Spark-TTS, Zero-shot (✓), Controllability (Pitch, Speed, Prosody, Timbre), Decoder-only Transformer, BiCodec, Token, 2025.03, Code
Vevo, Zero-shot (✓), Controllability (Pitch, Energy, Speed, Prosody, Timbre, Emotion), Decoder-only Transformer, BigVGAN, Token + MelS, 2025.02, Demo, Code
Step-Audio, Zero-shot (✓), Controllability (Prosody, Timbre, Emotion, Description), Decoder-only Transformer, Flow-based Vocoder, Token, 2025.02, Code
FleSpeech, Zero-shot (✓), Controllability (Pitch, Energy, Speed, Prosody, Timbre, Emotion, Description), Flow-based DiT, WaveGAN, Latent Feature, 2025.01, Demo
IDEA-TTS, Zero-shot (✓), Controllability (Timbre, Environment), Transformer, Flow-based Vocoder, LinS + MelS, 2024.12, Demo, Code
KALL-E, Zero-shot (✓), Controllability (Prosody, Timbre, Emotion), Decoder-only Transformer, WaveVAE, Latent Feature, 2024.12, Demo
IST-LM, Zero-shot (✓), Controllability (Prosody, Timbre), Decoder-only Transformer, HiFi-GAN, Token + MelS, 2024.12
SLAM-Omni, Zero-shot (✓), Controllability (Prosody, Timbre), Decoder-only Transformer, HiFi-GAN, Token + MelS, 2024.12, Demo, Code
FishSpeech, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, Firefly-GAN,Token, 2024.11, Code
HALL-E, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, EnCodec, Token, 2024.10
Takin, Zero-shot (✓), Controllability (Pitch, Speed, Prosody, Timbre, Emotion, Description), Decoder-only Transformer + Flow, HiFi-GAN, Token + MelS, 2024.09, Demo
CoFi-Speech, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, BigVGAN, Token + MelS, 2024.09, Demo
FireRedTTS, Zero-shot (✓), Controllability (Prosody, Timbre), Decoder-only Transformer + Flow, BigVGAN-v2, Token + MelS, 2024.09, Demo, Code
Emo-DPO, Zero-shot (✗), Controllability (Emotion), Decoder-only Transformer, HiFi-GAN, Token + MelS, 2024.09, Demo
VoxInstruct, Zero-shot (✓), Controllability (Pitch, Energy, Speed, Prosody, Timbre, Emotion, Description), Decoder-only Transformer, Vocos, Token, 2024.08, Demo, Code
MELLE, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, HiFi-GAN, MelS, 2024.07. Demo
CosyVoice, Zero-shot (✓), Controllability (Pitch, Speed, Prosody, Timbre, Emotion, Description), Decoder-only Transformer + Flow, HiFi-GAN, Token, 2024.07, Demo, Code
XTTS, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer + GAN, HiFi-GAN-based Vococder, Token + MelS, 2024.06, Demo, Code
VoiceCraft, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, HiFi-GAN, Token, 2024.06, Code
Seed-TTS, Zero-shot (✓), Controllability (Timbre, Emotion), Decoder-only Transformer + DiT, Unknown Vocoder, Latent Feature, 2024.06, Demo
VALL-E 2, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, Vocos, Token, 2024.06, Demo, Code (unofficial 1), Code (unofficial 2)
VALL-E R, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, Vocos, Token, 2024.06, Demo
ARDiT, Zero-shot (✓), Controllability (Speed, Timbre), Decoder-only DiT, BigVGAN, MelS, 2024.06, Demo
RALL-E, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, SoundStream, Token, 2024.05, Demo
CLaM-TTS, Zero-shot (✓), Controllability (Timbre), Encoder-decoder Transformer, BigVGAN, Token + MelS, 2024.04, Demo
BaseTTS, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, Speechcode Decoder, Token, 2024.02, Demo
ELLA-V, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, EnCodec, Token, 2024.01, Demo
UniAudio, Zero-shot (✓), Controllability (Pitch, Speed, Prosody, Timbre, Description), Decoder-only Transformer, UniAudio Codec, Token, 2023.10, Demo, Code
Salle, Zero-shot (✗), Controllability (Pitch, Energy, Speed, Prosody, Timbre, Emotion, Description), Decoder-only Transformer, EnCodec, Token, 2023.08, Demo
SC VALL-E, Zero-shot (✓), Controllability (Pitch, Energy, Speed, Prosody, Timbre, Emotion), Decoder-only Transformer, EnCodec, Token, 2023.07, Demo, Code
MegaTTS, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer + GAN, HiFi-GAN, MelS, 2023.06, Demo
TorToise, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer + Diffusion, UnivNet, MelS, 2023.05, Code
Make-a-voice, Zero-shot (✓), Controllability (Timbre), Encoder-decoder Transformer, Unit-based Vocoder, Token, 2023.05, Demo
VALL-E X, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, EnCodec, Token, 2023.03, Demo, Code (unofficial)
SpearTTS, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, SoundStream, Token, 2023.02, Demo, Code (unofficial)
VALL-E, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, EnCodec, Token, 2023.01, Demo, Code (unofficial 1), Code (unofficial 2)
MsEmoTTS, Zero-shot (✓), Controllability (Pitch, Prosody, Emotion), CNN + RNN, WaveRNN, MelS, 2022.01, Demo
Flowtron, Zero-shot (✗), Controllability (Pitch, Speed, Prosody), CNN + RNN, WaveGlow, MelS, 2020.07, Demo, Code
DurIAN, Zero-shot (✗), Controllability (Pitch, Speed, Prosody), CNN + RNN, MB-WaveRNN, MelS, 2019.09, Demo, Code (unofficial)
VAE-Tacotron, Zero-shot (✗), Controllability (Pitch, Speed, Prosody), VAE, WaveNet, MelS, 2019.02, Code (unoffcial 1), Code (unoffcial 2)
GMVAE-Tacotron, Zero-shot (✗), Controllability (Pitch, Speed, Prosody, Description), VAE, WaveRNN, MelS, 2018.12, Demo, Code (unofficial)
GST-Tacotron, Zero-shot (✗), Controllability (Pitch, Prosody), CNN + RNN, Griffin-Lim, LinS, 2018.03, Demo, Code (unofficial)
Prosody-Tacotron, Zero-shot (✗), Controllability (Pitch, Prosody), RNN, WaveNet, MelS, 2018.03, Demo

💾 Datsets

A summary of open-source datasets for controllable TTS:

Dataset	Hours	#Speakers	Labels											Lang	Release Time
			Pit.	Ene.	Spe.	Age	Gen.	Emo.	Emp.	Acc.	Top.	Des.	Dia.
SpeechCraft	2,391	3,200	✓	✓	✓	✓	✓	✓	✓		✓	✓		en,zh	2024
Parler-TTS	50,000	/	✓		✓		✓	✓		✓		✓		en	2024
MSceneSpeech	13	13									✓			zh	2024
VccmDataset	330	1,324	✓	✓	✓		✓	✓				✓		en	2024
CLESC	<1	/	✓	✓	✓			✓						en	2024
TextrolSpeech	330	1,324	✓	✓	✓		✓	✓				✓		en	2023
DailyTalk	20	2						✓			✓		✓	en	2023
MagicData-RAMC	180	663									✓		✓	zh	2022
PromptSpeech	/	/	✓	✓	✓			✓				✓		en	2022
WenetSpeech	10,000	/									✓			zh	2021
GigaSpeech	10,000	/									✓			en	2021
ESD	29	10						✓						en,zh	2021
CommonVoice	2,500	50,000				✓	✓			✓				multi	2020
AISHELL-3	85	218				✓	✓			✓				zh	2020
Taskmaster-1	/	/											✓	en	2019
CMU-MOSEI	65	1,000						✓						en	2018
RAVDESS	/	24				✓		✓						en	2018
RECOLA	3.8	46						✓						fr	2013
IEMOCAP	12	10	✓	✓	✓		✓	✓						en	2008

Abbreviations: Pit(ch), Ene(rgy)=volume, Spe(ed), Gen(der), Emo(tion), Emp(hasis), Acc(ent), Top(ic), Des(cription), Env(ironment), Dia(logue).

📏 Evaluation Metrics

Metric	Type	Eval Target	GT Required
Mel-Cepstral Distortion (MCD) $\downarrow$	Objective	Acoustic similarity	✓
Frequency Domain Score Difference (FDSD) $\downarrow$	Objective	Acoustic similarity	✓
Word Error Rate (WER) $\downarrow$	Objective	Intelligibility	✓
Cosine Similarity $\downarrow$	Objective	Speaker similarity	✓
Perceptual Evaluation of Speech Quality (PESQ) $\uparrow$	Objective	Perceptual quality	✓
Signal-to-Noise Ratio (SNR) $\uparrow$	Objective	Perceptual quality	✓
Mean Opinion Score (MOS) $\uparrow$	Subjective	Preference
Comparison Mean Opinion Score (CMOS) $\uparrow$	Subjective	Preference
AB Test	Subjective	Preference
ABX Test	Subjective	Perceptual similarity	✓

GT: Ground truth, $\downarrow$: Lower is better, $\uparrow$: Higher is better.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
images		images
README.md		README.md
license		license

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome Controllabe Speech Synthesis

🚀 Non-autoregressive Controllable TTS

🎞️ Autoregressive Controllable TTS

💾 Datsets

📏 Evaluation Metrics

About

Releases

Packages

Contributors 2

License

imxtx/awesome-controllable-speech-synthesis

Folders and files

Latest commit

History

Repository files navigation

Awesome Controllabe Speech Synthesis

🚀 Non-autoregressive Controllable TTS

🎞️ Autoregressive Controllable TTS

💾 Datsets

📏 Evaluation Metrics

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages