State-of-the-art text-to-speech models can produce snippets that sound nearly human-like on first listen. In point of fact, they underpin the neural voices available through Google Assistant, as well as the newscaster voice that recently came to Alexa and Amazon’s Polly service. But because most of the models share the same synthesis approaches in common — this is, they generate a mel-spectrogram (a representation of a sound’s power) from text and then synthesize speech using a vocoder (a codec that analyzes and synthesizes voice signals) — they suffer the same shortcomings, namely slow inference for mel-spectrogram generation and skipped and repeated words in synthesized speech.
In an attempt to solve these and other text-to-speech-related challenges, researchers from Microsoft and Zhejiang University developed FastSpeech, a novel machine learning model that they detail in a paper (“FastSpeech: Fast, Robust and Controllable Text to Speech“) accepted to the NeurIPS 2019 conference in Vancouver. It features a unique architecture that not only improves performance in a number of areas compared with other text-to-speech models (its mel-spectrogram generation is 270 times faster than the baseline and its voice generation is 38 times faster), but that eliminates errors like word skipping and affords fine-grained adjustment of speed and word break.
Importantly, FastSpeech contains a length regulator that reconciles the difference between mel-spectrograms sequences and sequences of phonemes (perceptually distinct units of sound). Since the length of phoneme sequences is always smaller than that of mel-spectrogram sequences, one phoneme corresponds to several mel-spectrograms. The length regulator, then, expands the sequence of phonemes according to the duration to match the length of a mel-spectrogram sequence. (A complementary duration predictor component determines the duration of each phoneme.) Increasing or decreasing the number of mel-spectrograms that align to a phoneme, or the phoneme duration, adjusts the voice speed proportionally.
To verify FastSpeech’s effectiveness, the researchers tested it against the open source LJ Speech data set, which contains 13,100 English audio clips (amounting to 24 hours of audio) and the corresponding text transcripts. After randomly splitting the corpus into 12,500 samples for training, 300 samples for validation, and 300 samples for testing, they conducted a series of evaluations on voice quality, robustness, and more.
The team reports that FastSpeech nearly matched the quality of Google’s Tacotron 2 text-to-speech model and handily outperformed a leading Transformer-based model in terms of robustness, managing an effective error rate of 0% compared with the baseline’s 34%. (Concededly, the robustness test only involved 50 sentences, albeit sentences selected for their semantic complexity.) Moreover, it was able to very the speed of generated voices from 0.5 times to 1.5 times without a loss of accuracy.
Here’s a few samples:
Future work will involve combining FastSpeech and a speedier vocoder into a single model for a “purely end-to-end” text-to-speech solution.