In a paper published on the preprint server Arxiv.org, researchers affiliated with Carnegie Mellon, the University of California, Santa Barbara, and Microsoft’s Dynamics 365 AI Research describe a challenge — video-and-language inference — that tasks AI with inferring whether a statement is entailed or contradicted by a given video clip. The idea is to spur investigations into video-and-language understanding, they say, which could enhance tools used in the enterprise for automatic meeting transcription.
As the researchers explain, video-and-language inference requires a thorough interpretation of both visual and textual clues. They to this end introduce a video data set comprising realistic scenes paired with statements from crowdsourced workers via Amazon Mechanical Turk, who watched the videos accompanied by subtitles. The workers wrote statements based on their understanding of both the videos and subtitles, which not only describe explicit information in the video (e.g., objects, locations, characters, and social activity) but that also reveal comprehension of complex plots (understanding events, interpreting human emotions and relations, and inferring causal relations of events).
In total, the data set contains over 95,322 video-statement pairs and 15,887 movie clips from YouTube and TV series — including Friends, Desperate Housewives, How I Met Your Mother, Modern Family — spanning over 582 hours. Each roughly 30-second video is paired with six either positive or negative statements that identify characters, recognize actions, reason about conversations, infer reasons, or make reference to human dynamics. (In order to prevent bias from creeping in, when collecting negative statements, the researchers asked annotators to use a positive statement as a reference and only modify a small portion of it to make it negative.)
To benchmark the data set, the coauthors used a bi-directional long short-term memory model, a type of AI model capable of learning long-term dependencies, to encode video features as numerical representations. A separate model encoded statements and subtitles. Given a video, subtitle, and statement, yet another model — which was trained on 80% of the data set, with 10% reserved for validation and 10% for testing — determined whether the statement entailed or contradicted with the video and subtitles. They say that the best-performing baseline achieved 59.45% accuracy, compared with human evaluators’ 85.20% accuracy.
“The gap between the baseline models and human performance is significant. We encourage the community to participate in this task and invent stronger methods to push the state of the art on multimodal inference,” wrote the researchers. “Possible future directions include developing models to localize key frames, as well as better utilizing the alignment between video and subtitles to improve reasoning ability.”
The research follows a study by Microsoft Research Asia and Harbin Institute of Technology that sought to generate live video captions with AI by capturing the representations among comments, video, and audio. The system — the code for which is available on Github — matches the most relevant comments with videos from a candidate set so that it jointly learns cross-modal representations.