Skip to the content.

Abstract

Expressive text-to-speech aims to generate high-quality samples with rich and diverse prosody, which is hampered by dual challenges: 1) prosodic attributes in highly dynamic voices are difficult to capture and model without intonation; and 2) highly multimodal prosodic representations cannot be well learned by simple regression (e.g., MSE) objectives, which causes blurry and over-smoothing predictions. This paper proposes Prosody-TTS, a two-stage pipeline that enhances prosody modeling and sampling by introducing several components: 1) a self-supervised masked autoencoder to model the prosodic representation without relying on text transcriptions or local prosody attributes, which ensures to cover diverse speaking voices with superior generalization; and 2) a diffusion model to sample diverse prosodic patterns within the latent space, which prevents TTS models from generating samples with dull prosodic performance. Experimental results show that Prosody-TTS achieves new state-of-the-art in text-to-speech with natural and expressive synthesis. Both subjective and objective evaluation demonstrate that it exhibits superior audio quality and prosody naturalness with rich and diverse prosodic attributes. Audio samples are available at https://improve-prosody.github.io/

Comparison with other models

Diverse Multi-Speaker LibriTTS

Text: There was not a worse vagabond in Shrewsbury than old Barney the piper.

GT GT (voc) FastSpeech 2 Meta-StyleSpeech Glow-TTS Grad-TTS YourTTS Prosody-TTS

Text: He spoke instead, in a light tone, as his pen still ran along.

GT GT (voc) FastSpeech 2 Meta-StyleSpeech Glow-TTS Grad-TTS YourTTS Prosody-TTS

Text: The young Englishmen were introduced to everybody, entertained by everybody, intimate with everybody.

GT GT (voc) FastSpeech 2 Meta-StyleSpeech Glow-TTS Grad-TTS YourTTS Prosody-TTS

Text: In a few days I had so far recovered my health that I could sit up all day, and walk out sometimes.

GT GT (voc) FastSpeech 2 Meta-StyleSpeech Glow-TTS Grad-TTS YourTTS Prosody-TTS

Single-Speaker LJSpeech

Text: the invention of movable metal letters in the middle of the fifteenth century may justly be considered as the invention of the art of printing.

GT GT (voc) FastSpeech 2 Meta-StyleSpeech Glow-TTS Grad-TTS YourTTS Prosody-TTS

Text: printing, then, for our purpose, may be considered as the art of making books by means of movable types.

GT GT (voc) FastSpeech 2 Meta-StyleSpeech Glow-TTS Grad-TTS YourTTS Prosody-TTS

Ablation Study

Text: There was not a worse vagabond in Shrewsbury than old Barney the piper.

Prosody-MAE w/o LDM w/o VQ Local Prosody Variational Inference

Text: In a few days I had so far recovered my health that I could sit up all day, and walk out sometimes.

Prosody-MAE w/o LDM w/o VQ Local Prosody Variational Inference