Health organizations including the U.S. Centers for Disease Control and Prevention, the World Health Organization, and the U.K. National Health Service advocate wearing masks to prevent the spread of infection. But masks attenuate speech, which has implications for the accuracy of speech recognition systems like Google Assistant, Alexa, and Siri. In an effort to quantify the degree to which mask materials impact acoustics, researchers at the University of Illinois conducted a study examining 12 different types of face coverings in total. They found that transparent masks had the worst acoustics compared with both medical and cloth masks, but that most masks had “little effect” on lapel microphones, suggesting existing systems might be able to recognize muffled speech without issue.
While it’s intuitive to assume mask-distorted speech would prove to be challenging for speech recognition, the evidence so far paints a mixed picture. Research published by the Educational Testing Service (ETS) concluded that while differences existed between recordings of mask wearers and those who didn’t wear masks during an English proficiency exam, the distortion didn’t lead to “significant” variations in automated exam scoring. But in a separate study, scientists at Duke Kunshan University, Lenovo, and Wuhan University found an AI system could be trained to detect whether someone’s wearing a mask from the sound of their muffled speech.
A Google spokesperson told VentureBeat there hasn’t been a measurable impact on the company’s speech recognition systems since the start of the pandemic, when mask-wearing became more common. Amazon also says it hasn’t observed a shift in speech recognition accuracy correlated with mask-wearing.
The University of Illinois researchers looked at the acoustic effects of a polypropylene surgical mask, N95 and KN95 respirators, six cloth masks made from different fabrics, two cloth masks with transparent windows, and a plastic shield. They took measurements within an “acoustically-treated” lab using a head-shaped loudspeaker and a human volunteer, both of whom had microphones placed on and near their lapel, cheek, forehead, and mouth. (The head-shaped loudspeaker, which was made of plywood, used a two-inch driver with a pattern close to that of a human speaker.)
After taking measurements without face coverings to establish a baseline, the researchers set the loudspeaker on a turntable and rotated it to capture various angles of the tested masks. Then, for each mask, they had the volunteer speak in three 30-second increments at a constant volume.
The results show that most masks had “little effect” below a frequency of 1 kHz but were muffled at higher frequencies in varying degrees. The surgical mask and KN95 respirator had peak attenuation of around 4 dB, while the N95 attenuated at high frequencies by about 6 dB. As for the cloth masks, material and weave proved to be the key variables — 100% cotton masks had the best acoustic performance while masks made from tightly woven denim and bedsheets performed the worst. Transparent masks blocked between 8 dB and 14 dB at high frequencies, making them by far the worst of the bunch.
“For all masks tested, acoustic attenuation was strongest in the front. Sound transmission to the side of and behind the talker was less strongly affected by the masks, and the shield amplified sound behind the talker,” the researchers in a paper describing their work. “These results suggest that masks may deflect sound energy to the sides rather than absorbing it. Therefore, it may be possible to use microphones placed to the side of the mask for sound reinforcement.”
The researchers recommend avoiding cotton and spandex-blended masks for the clearest and crispest speech, but they note that recordings captured by the lapel mic showed “small” and “uniform” attenuation — the sort of attenuation that recognition systems can easily correct for. For instance, Amazon recently launched Whisper Mode for Alexa, which taps AI trained on a corpus of professional voice recordings to respond to whispered (i.e., low-decibel) speech by whispering back. An Amazon spokesperson didn’t say whether Whisper Mode is being used to improve masked speech performance, but they told VentureBeat that when Alexa speech recognition systems’ signal-to-noise ratios are lower due to customers wearing masks, engineering teams are able to address fluctuations in confidence through an active learning pipeline.
In any case, assuming the results of the University of Illinois stand up to peer review, they bode well for smart speakers, smart displays, and other voice-powered smart devices. Next time you lift your phone to summon Siri, you shouldn’t have to ditch the mask.