AI and machine learning algorithms capable of reading lips from videos aren’t anything out of the ordinary, in truth. Back in 2016, researchers from Google and the University of Oxford detailed a system that could annotate video footage with 46.8% accuracy, outperforming a professional human lip-reader’s 12.4% accuracy. But even state-of-the-art systems struggle to overcome ambiguities in lip movements, preventing their performance from surpassing that of audio-based speech recognition.
In pursuit of a more performant system, researchers at Alibaba, Zhejiang University, and the Stevens Institute of Technology devised a method dubbed Lip by Speech (LIBS), which uses features extracted from speech recognizers to serve as complementary clues. They say it manages industry-leading accuracy on two benchmarks, besting the baseline by a margin of 7.66% and 2.75% in character error rate.
LIBS and other solutions like it could help those hard of hearing to follow videos that lack subtitles. It’s estimated that 466 million people in the world suffer from disabling hearing loss, or about 5% of the world’s population. By 2050, the number could rise to over 900 million, according to the World Health Organization.
LIBS distills useful audio information from videos of human speakers at multiple scales, including at the sequence level, context level, and frame level. It then aligns this data with video data by identifying the correspondence between them (due to different sampling rates and blanks that sometimes appear at the beginning or end, the video and audio sequences have inconsistent lengths), and it leverages a filtering technique to refine the distilled features.
Both the speech recognizer and lip reader components of LIBS are based on an attention-based sequence-to-sequence architecture, a method of machine translation that maps an input of a sequence (i.e., audio or video) to an output with a tag and attention value. The researchers trained them on the aforementioned and LRS2, which contains more than 45,000 spoken sentences from the BBC, and on CMLR, the largest available Chinese Mandarin lip-reading corpus with over 100,000 natural sentences from the China Network Television website (including over 3,000 Chinese characters and 20,000 phrases).
The team notes that the model struggled to achieve “reasonable” results on the LRS2 data set, owing to the shortness of some sentences. (The decoder struggles to extract relevant information from sentences with fewer than 14 characters.) However, once it was pre-trained on sentences with a maximum length of 16 words, the decoder improved the quality of the end parts of sentences in the LRS2 data set by leveraging context-level knowledge. “[LIBS reduces] the focus on unrelated frames,” wrote the researchers in a paper describing their work. “[T]he frame-level knowledge distillation further improves the discriminability of the video frame features, making the attention more focused.”