Abstract
Expressive text-to-speech aims to generate high-quality samples with rich and diverse prosody, which is hampered by dual challenges: 1) prosodic attributes in highly dynamic voices are difficult to capture and model without intonation; and 2) highly multimodal prosodic representations cannot be well learned by simple regression (e.g., MSE) objectives, which causes blurry and over-smoothing predictions. This paper proposes Prosody-TTS, a two-stage pipeline that enhances prosody modeling and sampling by introducing several components: 1) a self-supervised masked autoencoder to model the prosodic representation without relying on text transcriptions or local prosody attributes, which ensures to cover diverse speaking voices with superior generalization; and 2) a diffusion model to sample diverse prosodic patterns within the latent space, which prevents TTS models from generating samples with dull prosodic performance. Experimental results show that Prosody-TTS achieves new state-of-the-art in text-to-speech with natural and expressive synthesis. Both subjective and objective evaluation demonstrate that it exhibits superior audio quality and prosody naturalness with rich and diverse prosodic attributes. Audio samples are available at https://improve-prosody.github.io/
Comparison with other models
Diverse Multi-Speaker LibriTTS
Text: There was not a worse vagabond in Shrewsbury than old Barney the piper.
GT | GT (voc) | FastSpeech 2 | Meta-StyleSpeech | Glow-TTS | Grad-TTS | YourTTS | Prosody-TTS |
---|---|---|---|---|---|---|---|
Text: He spoke instead, in a light tone, as his pen still ran along.
GT | GT (voc) | FastSpeech 2 | Meta-StyleSpeech | Glow-TTS | Grad-TTS | YourTTS | Prosody-TTS |
---|---|---|---|---|---|---|---|
Text: The young Englishmen were introduced to everybody, entertained by everybody, intimate with everybody.
GT | GT (voc) | FastSpeech 2 | Meta-StyleSpeech | Glow-TTS | Grad-TTS | YourTTS | Prosody-TTS |
---|---|---|---|---|---|---|---|
Text: In a few days I had so far recovered my health that I could sit up all day, and walk out sometimes.
GT | GT (voc) | FastSpeech 2 | Meta-StyleSpeech | Glow-TTS | Grad-TTS | YourTTS | Prosody-TTS |
---|---|---|---|---|---|---|---|
Single-Speaker LJSpeech
Text: the invention of movable metal letters in the middle of the fifteenth century may justly be considered as the invention of the art of printing.
GT | GT (voc) | FastSpeech 2 | Meta-StyleSpeech | Glow-TTS | Grad-TTS | YourTTS | Prosody-TTS |
---|---|---|---|---|---|---|---|
Text: printing, then, for our purpose, may be considered as the art of making books by means of movable types.
GT | GT (voc) | FastSpeech 2 | Meta-StyleSpeech | Glow-TTS | Grad-TTS | YourTTS | Prosody-TTS |
---|---|---|---|---|---|---|---|
Ablation Study
Text: There was not a worse vagabond in Shrewsbury than old Barney the piper.
Prosody-MAE | w/o LDM | w/o VQ | Local Prosody | Variational Inference |
---|---|---|---|---|
Text: In a few days I had so far recovered my health that I could sit up all day, and walk out sometimes.
Prosody-MAE | w/o LDM | w/o VQ | Local Prosody | Variational Inference |
---|---|---|---|---|