In October 2018, months after a brief reveal, Amazon brought Whisper Mode to select third- and first-party Alexa devices. It expanded the feature to all locales in November 2019, such that all smart and smart home appliances powered by Alexa — the company’s virtual assistant — now respond to whispered speech by whispering back.
Amazon was a bit light on the technical details initially, save that Whisper Mode uses a neural network — layers of mathematical functions loosely modeled after the human brain’s neurons — to distinguish among normal and whispered words. But in an academic paper appearing in the January 2020 issue of the journal IEEE Signal Processing Letters and an accompanying blog post, it detailed the research that led to the expansion.
The principal challenge was converting normal speech into whispered speech while maintaining naturalness and speaker identity, explained Marius Cotescu, an applied scientist in Amazon’s text-to-speech research group. He and colleagues investigated several different conversion techniques including a handcrafted digital-signal-processing (DSP) based on acoustic analysis of whispered speech, but they ultimately chose two machine learning approaches for their robustness (they generalized readily to unfamiliar speakers) and their performance (they outperformed the handcrafted signal processor).
Both approaches — which drew on Gaussian mixture models (GMMs) and deep neural networks (DNNs) — involved training algorithms to map the acoustic features of normally voiced speech onto those of whispered speech. The GMMs attempted to identify a range of values for each output feature corresponding to a related distribution of input values, while the DNNs — dense algorithms of simple processing nodes — adjusted their internal settings through a process in which the networks attempted to predict the outputs associated with particular inputs.
The researchers’ system passed acoustic-feature representations to a vocoder, which converted them into continuous signals. While the experimental version relied on an open-source vocoder called WORLD, the version of Whisper Mode deployed to customers leverages a neural vocoder that enhances whispered speech quality further.
The team used two data sets to train their voice conversion systems: One they produced themselves using five professional voice actors from Australia, Canada, Germany, India, and the U.S. and a popular benchmark in the field. (Both corpora included pairs of utterances — one in full voice, one whispered — from many speakers.) And to evaluate their systems, they compared the outputs to both recordings of natural speech and recordings of speech fed through a vocoder.
In a first set of experiments, the team trained the voice conversion systems on data from individual speakers and tested them on data from the same speakers. They found that, while the raw recordings sounded most natural, whispers synthesized by the models sounded more natural than “vocoded” human speech.
State-of-the-art text-to-speech models can produce snippets that sound nearly humanlike on first listen. In point of fact, they underpin the neural voices available through Google Assistant as well as the newscaster voice that recently came to Alexa and Amazon’s Polly service, and the Samuel L. Jackson celebrity Alexa voice skill that became available last December.