Can face masks affect the accuracy of automatic speech recognition systems? That’s the question researchers at the Educational Testing Service (ETS), the nonprofit assessment organization headquartered in Princeton, New Jersey, sought to answer in a study published this week. Drawing on recordings from ETS’ English language proficiency test, for which exam-takers were required to wear face masks, they found that while differences between the recordings and no-mask baselines existed, they didn’t lead to “significant” variations in scores.
The pandemic has led to a dramatic increase in the use of face masks worldwide, with 65% of U.S. adults saying they wore a mask in stores during the month of May, according to the Pew Research Center. This has potential implications for the speech algorithms underpinning smart speakers, smart displays, mobile apps, and indeed automated language proficiency tests. Face coverings come in all sizes and thicknesses and can impact a wearer’s speech patterns, for example by distorting the sound of a person’s speech or by greatly attenuating it.
The researchers set out to determine whether masks might introduce bias in automated language proficiency exams — a salient question considering regulations in some regions of the world make mask-wearing compulsory for test-takers. They collated a corpus of 1,188 responses from 597 people collected in a language test in Hong Kong between February and March, during which exam-takers were tasked with answering questions for between 45 seconds and 1 minute each.
For the purposes of comparison, the researchers created a baseline with 1,200 spoken responses from 300 people who took the test in fall 2019, before mask-wearing rules were implemented.
To suss out the differences between the two data sets, the researchers extracted and compared 88 acoustic features including measurements related to frequency, amplitude, and spectral characteristics. They also considered whether face masks had any effect on test-taker speech patterns, using speech recognition hypotheses and timestamps to compute features designed to capture whether those wearing masks made more pauses, spoke more slowly, or showed different patterns of disfluencies.
The coauthors report that four features they considered — average duration of silences and number of silences per word as well as the duration of chunks between pauses — showed differences. Speakers wearing masks spoke with about the same articulation rate as those not wearing masks, but paused slightly more often, and wearing masks reduced the duration of chunks between two pauses by 0.6 words (or 0.2 seconds).
These differences, however, didn’t appear to manifest in speech recognition performance metrics. After comparing a sample of 55 transcribed responses — 28 from the mask-wearing group and 27 from the baseline group — the researchers found the word error rate was lower for mask wearers compared with those not donning masks (27.6% versus 20.7%). They also report that the mean test scores across both groups were virtually the same: 2.79 for the baseline and 2.80 for the mask-wearers.
The researchers might have a conflict of interest given the test they evaluated is ETS’ own and that their sample size is on the small side. But in support of their findings, they cite previous, smaller studies showing masks have “no significant effect” on language proficiency scores assigned by raters or on the accuracy of “closed-set” speaker identification systems.
“Our classifier experiments showed that it is possible to predict with almost 80% accuracy whether a test-taker is wearing a mask or not … However, these differences in acoustics and speech patterns did not have a further effect on the performance of automated speech recognition or the automated scoring engine,” the researchers wrote. “The differences we observed for low-level acoustic features suggest that some types of technologies and applications may be more affected than others.”
The work follows a study from Duke Kunshan University, Wuhan University, Lenovo, and Sun Yat-sen University in Guangzhou describing a system that can detect with 78.8% accuracy whether a person is wearing a mask from the sound of their speech. Masked speech detection is a sub-challenge at the 11th annual Computational Paralinguistics Challenge (ComParE), which is scheduled to take place during the upcoming Interspeech 2020 conference in October.