Skip to content

This is an evolving repo for the paper "Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey".

License

Notifications You must be signed in to change notification settings

imxtx/awesome-controllable-speech-synthesis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 

Repository files navigation

Awesome Controllabe Speech Synthesis

This is an evolving repo for the survey: Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey. If you find our survey useful for your research, please 📚cite📚 the following paper:

@article{xie2024towards,
  title={Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey},
  author={Xie, Tianxin and Rong, Yan and Zhang, Pengfei and Wang, Wenwu and Liu, Li},
  journal={arXiv preprint arXiv:2412.06602},
  year={2024}
}

If you find any mistakes, please don’t hesitate to open an issue.

🚀 Non-autoregressive Controllable TTS

Below are representative non-autoregressive controllable TTS methods. Each entry follows this format: method name, zero-shot capability, controllability, acoustic model, vocoder, acoustic feature, release date, and code/demo.

NOTE: MelS and LinS represent Mel Spectrogram and Linear Spectrogram, respectively. Among today’s TTS systems, MelS, latent features (from VAEs, diffusion models, and other flow-based methods), and various types of discrete tokens are the most commonly used acoustic representations.

🎞️ Autoregressive Controllable TTS

Below are representative non-autoregressive controllable TTS methods. Each entry follows this format: method name, zero-shot capability, controllability, acoustic model, vocoder, acoustic feature, release date, and code/demo.

NOTE: MelS and LinS represent Mel Spectrogram and Linear Spectrogram, respectively. Among today’s TTS systems, MelS, latent features (from VAEs, diffusion models, and other flow-based methods), and various types of discrete tokens are the most commonly used acoustic representations.

  • EmoVoice, Zero-shot (✗), Controllability (Emotion, Description), Decoder-only Transformer, HiFi-GAN, Token, 2025.04, Demo
  • Spark-TTS, Zero-shot (✓), Controllability (Pitch, Speed, Prosody, Timbre), Decoder-only Transformer, BiCodec, Token, 2025.03, Code
  • Vevo, Zero-shot (✓), Controllability (Pitch, Energy, Speed, Prosody, Timbre, Emotion), Decoder-only Transformer, BigVGAN, Token + MelS, 2025.02, Demo, Code
  • Step-Audio, Zero-shot (✓), Controllability (Prosody, Timbre, Emotion, Description), Decoder-only Transformer, Flow-based Vocoder, Token, 2025.02, Code
  • FleSpeech, Zero-shot (✓), Controllability (Pitch, Energy, Speed, Prosody, Timbre, Emotion, Description), Flow-based DiT, WaveGAN, Latent Feature, 2025.01, Demo
  • IDEA-TTS, Zero-shot (✓), Controllability (Timbre, Environment), Transformer, Flow-based Vocoder, LinS + MelS, 2024.12, Demo, Code
  • KALL-E, Zero-shot (✓), Controllability (Prosody, Timbre, Emotion), Decoder-only Transformer, WaveVAE, Latent Feature, 2024.12, Demo
  • IST-LM, Zero-shot (✓), Controllability (Prosody, Timbre), Decoder-only Transformer, HiFi-GAN, Token + MelS, 2024.12
  • SLAM-Omni, Zero-shot (✓), Controllability (Prosody, Timbre), Decoder-only Transformer, HiFi-GAN, Token + MelS, 2024.12, Demo, Code
  • FishSpeech, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, Firefly-GAN,Token, 2024.11, Code
  • HALL-E, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, EnCodec, Token, 2024.10
  • Takin, Zero-shot (✓), Controllability (Pitch, Speed, Prosody, Timbre, Emotion, Description), Decoder-only Transformer + Flow, HiFi-GAN, Token + MelS, 2024.09, Demo
  • CoFi-Speech, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, BigVGAN, Token + MelS, 2024.09, Demo
  • FireRedTTS, Zero-shot (✓), Controllability (Prosody, Timbre), Decoder-only Transformer + Flow, BigVGAN-v2, Token + MelS, 2024.09, Demo, Code
  • Emo-DPO, Zero-shot (✗), Controllability (Emotion), Decoder-only Transformer, HiFi-GAN, Token + MelS, 2024.09, Demo
  • VoxInstruct, Zero-shot (✓), Controllability (Pitch, Energy, Speed, Prosody, Timbre, Emotion, Description), Decoder-only Transformer, Vocos, Token, 2024.08, Demo, Code
  • MELLE, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, HiFi-GAN, MelS, 2024.07. Demo
  • CosyVoice, Zero-shot (✓), Controllability (Pitch, Speed, Prosody, Timbre, Emotion, Description), Decoder-only Transformer + Flow, HiFi-GAN, Token, 2024.07, Demo, Code
  • XTTS, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer + GAN, HiFi-GAN-based Vococder, Token + MelS, 2024.06, Demo, Code
  • VoiceCraft, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, HiFi-GAN, Token, 2024.06, Code
  • Seed-TTS, Zero-shot (✓), Controllability (Timbre, Emotion), Decoder-only Transformer + DiT, Unknown Vocoder, Latent Feature, 2024.06, Demo
  • VALL-E 2, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, Vocos, Token, 2024.06, Demo, Code (unofficial 1), Code (unofficial 2)
  • VALL-E R, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, Vocos, Token, 2024.06, Demo
  • ARDiT, Zero-shot (✓), Controllability (Speed, Timbre), Decoder-only DiT, BigVGAN, MelS, 2024.06, Demo
  • RALL-E, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, SoundStream, Token, 2024.05, Demo
  • CLaM-TTS, Zero-shot (✓), Controllability (Timbre), Encoder-decoder Transformer, BigVGAN, Token + MelS, 2024.04, Demo
  • BaseTTS, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, Speechcode Decoder, Token, 2024.02, Demo
  • ELLA-V, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, EnCodec, Token, 2024.01, Demo
  • UniAudio, Zero-shot (✓), Controllability (Pitch, Speed, Prosody, Timbre, Description), Decoder-only Transformer, UniAudio Codec, Token, 2023.10, Demo, Code
  • Salle, Zero-shot (✗), Controllability (Pitch, Energy, Speed, Prosody, Timbre, Emotion, Description), Decoder-only Transformer, EnCodec, Token, 2023.08, Demo
  • SC VALL-E, Zero-shot (✓), Controllability (Pitch, Energy, Speed, Prosody, Timbre, Emotion), Decoder-only Transformer, EnCodec, Token, 2023.07, Demo, Code
  • MegaTTS, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer + GAN, HiFi-GAN, MelS, 2023.06, Demo
  • TorToise, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer + Diffusion, UnivNet, MelS, 2023.05, Code
  • Make-a-voice, Zero-shot (✓), Controllability (Timbre), Encoder-decoder Transformer, Unit-based Vocoder, Token, 2023.05, Demo
  • VALL-E X, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, EnCodec, Token, 2023.03, Demo, Code (unofficial)
  • SpearTTS, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, SoundStream, Token, 2023.02, Demo, Code (unofficial)
  • VALL-E, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, EnCodec, Token, 2023.01, Demo, Code (unofficial 1), Code (unofficial 2)
  • MsEmoTTS, Zero-shot (✓), Controllability (Pitch, Prosody, Emotion), CNN + RNN, WaveRNN, MelS, 2022.01, Demo
  • Flowtron, Zero-shot (✗), Controllability (Pitch, Speed, Prosody), CNN + RNN, WaveGlow, MelS, 2020.07, Demo, Code
  • DurIAN, Zero-shot (✗), Controllability (Pitch, Speed, Prosody), CNN + RNN, MB-WaveRNN, MelS, 2019.09, Demo, Code (unofficial)
  • VAE-Tacotron, Zero-shot (✗), Controllability (Pitch, Speed, Prosody), VAE, WaveNet, MelS, 2019.02, Code (unoffcial 1), Code (unoffcial 2)
  • GMVAE-Tacotron, Zero-shot (✗), Controllability (Pitch, Speed, Prosody, Description), VAE, WaveRNN, MelS, 2018.12, Demo, Code (unofficial)
  • GST-Tacotron, Zero-shot (✗), Controllability (Pitch, Prosody), CNN + RNN, Griffin-Lim, LinS, 2018.03, Demo, Code (unofficial)
  • Prosody-Tacotron, Zero-shot (✗), Controllability (Pitch, Prosody), RNN, WaveNet, MelS, 2018.03, Demo

💾 Datsets

A summary of open-source datasets for controllable TTS:

Dataset Hours #Speakers Labels Lang Release
Time
Pit. Ene. Spe. Age Gen. Emo. Emp. Acc. Top. Des. Dia.
SpeechCraft 2,391 3,200 en,zh 2024
Parler-TTS 50,000 / en 2024
MSceneSpeech 13 13 zh 2024
VccmDataset 330 1,324 en 2024
CLESC <1 / en 2024
TextrolSpeech 330 1,324 en 2023
DailyTalk 20 2 en 2023
MagicData-RAMC 180 663 zh 2022
PromptSpeech / / en 2022
WenetSpeech 10,000 / zh 2021
GigaSpeech 10,000 / en 2021
ESD 29 10 en,zh 2021
CommonVoice 2,500 50,000 multi 2020
AISHELL-3 85 218 zh 2020
Taskmaster-1 / / en 2019
CMU-MOSEI 65 1,000 en 2018
RAVDESS / 24 en 2018
RECOLA 3.8 46 fr 2013
IEMOCAP 12 10 en 2008

Abbreviations: Pit(ch), Ene(rgy)=volume, Spe(ed), Gen(der), Emo(tion), Emp(hasis), Acc(ent), Top(ic), Des(cription), Env(ironment), Dia(logue).

📏 Evaluation Metrics

Metric Type Eval Target GT Required
Mel-Cepstral Distortion (MCD) $\downarrow$ Objective Acoustic similarity
Frequency Domain Score Difference (FDSD) $\downarrow$ Objective Acoustic similarity
Word Error Rate (WER) $\downarrow$ Objective Intelligibility
Cosine Similarity $\downarrow$ Objective Speaker similarity
Perceptual Evaluation of Speech Quality (PESQ) $\uparrow$ Objective Perceptual quality
Signal-to-Noise Ratio (SNR) $\uparrow$ Objective Perceptual quality
Mean Opinion Score (MOS) $\uparrow$ Subjective Preference
Comparison Mean Opinion Score (CMOS) $\uparrow$ Subjective Preference
AB Test Subjective Preference
ABX Test Subjective Perceptual similarity

GT: Ground truth, $\downarrow$: Lower is better, $\uparrow$: Higher is better.

About

This is an evolving repo for the paper "Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey".

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published