In August, Google AI researchers working with the ALS Therapy Development Institute shared details about Project Euphonia, a speech-to-text transcription service for people with speaking impairments. They showed that, using data sets of audio from both native and non-native English speakers with neurodegenerative diseases and techniques from Parrotron, an AI tool for people with impediments, they could drastically improve the quality of speech synthesis and generation.
Recently, in something of a case study, Google researchers and a team from Alphabet’s DeepMind employed Euphonia in an effort to recreate the original voice of Tim Shaw, a former NFL football linebacker who played for the Carolina Panthers, Jacksonville Jaguars, Chicago Bears, and Tennessee Titans before retiring in 2013. Roughly six years ago, Shaw was diagnosed with ALS, which requires him to use a wheelchair and left him unable to speak, swallow, or breathe without assistance.
Over the course of six months, the joint research team adapted a generative AI model — WaveNet — to the task of synthesizing speech from samples of Shaw’s voice prior to his ALS diagnoses.
WaveNet mimics things like stress and intonation, referred to in linguistics as prosody, by identifying tonal patterns in speech. It produces much more convincing voice snippets than previous speech generation models — Google says it has already closed the quality gap with human speech by 70% based on mean opinion score — and it’s also more efficient. Running on Google’s tensor processing units (TPUs), custom chips packed with circuits optimized for AI model training, a one-second voice sample takes just 50 milliseconds to create.
In production, WaveNet has been used to generate custom voices for Google’s conversational platform Google Assistant, most recently nine new voices with unique cadences in English in the U.K. or India, as well as French, German, Japanese, Dutch, Norwegian, Korean, or Italian. It’s also been used to produce dozens of new voices and voice variants — 38 in August alone, and 31 in February — for Google’s Cloud Text-to-Speech service in Google Cloud Platform.
Fine-tuning proved to be the key to achieving high-quality synthesis from a minimal amount of training data. To recreate Shaw’s voice, Google and the DeepMind team adopted an approach proposed in a research paper published last year (“Sample Efficient Adaptive Text-to-Speech”), which involves pretraining a large WaveNet model on up to thousands of speakers over a few days until it can produce the basics of natural-sounding speech. At this point, the model is fed a small corpus of data from the target speaker, such that its generated speech takes on the characteristics of that speaker.
Architectural tweaks improved the process’ overall efficiency. The team migrated from WaveNet to WaveRNN, a compacter model that makes it possible to generate 24kHz 16-bit audio (up to 16 samples per step) four times faster than real time on a graphics card, and to sample high-fidelity audio on a mobile system-on-chip in real time. Separately, DeepMind in collaboration with Google applied a fine-tuning technique to Tacotron 2, a text-to-speech system that builds voice synthesis models based on spectrograms, or visual representations of the spectrum of frequencies of an audio signal as it varies with time. This improved the quality of Tacotron 2’s output, the team says, while cutting down on the amount of required training data.
“While the voice is not yet perfect — lacking the expressiveness, quirks, and controllability of a real voice — we’re excited that the combination of WaveRNN and Tacotron may help people like Tim preserve an important part of their identity, and we would like to one day integrate it into speech generation devices,” wrote Google and DeepMind in a blog post. “At this stage, it’s too early to know where our research will take us, but we are looking at ways to combine the Euphonia speech recognition systems with the speech synthesis technology so that people like Tim can more easily communicate.”
A demonstration of the AI-generated voice will feature in the The Age of A.I., a new YouTube Originals miniseries about emerging technologies hosted by Robert Downey Jr. In one of the first episodes, Tim and his family hear his old voice for the first time in years, as the model – trained on about 30 minutes of Tim’s NFL audio recordings – read out the letter he’d recently written to his younger self.
“Our text-to-speech system, WaveNet, was introduced in 2016 as a prototype addressing one of the core challenges in AI research,” said DeepMind vice president of research Koray Kavukcuoglu in a statement. “It’s been amazing to see its real-world utility evolve over time: first generating the voices for Google Assistant, and now its potential to help people with ALS, like Tim Shaw, to recreate and hear their original speaking voice. This project is an early proof of concept, but I look forward to seeing where the research goes next.”
This latest project follows the on the heels of accessibility efforts announced at Google’s I/O 2019 developer conference, including Project Euphonia. Another — Live Relay — is designed to assist deaf users, while Project Diva aims to give people with various disabilities some independence and autonomy via Google Assistant.