A paper coauthored by researchers at IBM describes an AI system — Navsynth — that generates videos seen during training as well as unseen videos. While this in and of itself isn’t novel — it’s an acute area of interest for Alphabet’s DeepMind and others — the researchers say the approach produces superior quality videos compared with existing methods. If the claim holds water, their system could be used to synthesize videos on which other AI systems train, supplementing real-world data sets that are incomplete or marred by corrupted samples.
As the researchers explain, the bulk of work in the video synthesis domain leverages GANs, or two-part neural networks consisting of generators that produce samples and discriminators that attempt to distinguish between the generated samples and real-world samples. They’re highly capable but suffer from a phenomenon called mode collapse, where the generator generates a limited diversity of samples (or even the same sample) regardless of the input.
By contrast, IBM’s system consists of a variable representing video content features, a frame-specific transient variable (more on that later), a generator, and a recurrent machine learning model. It breaks videos down into a static constituent that captures the constant portion of the video common for all frames and a transient constituent that represents the temporal dynamics (i.e., periodic regularity driven by time-based events) between all the frames in the video. Effectively, the system jointly learns the static and transient constituents, which it uses to generate videos at inference time.
To capture equally from the static portion of the video, the researchers’ system randomly chooses a frame and compares its corresponding generated frame during training. This ensures that the generated frame remains close to the ground truth frame.
In experiments, the research team trained, validated, and tested the system on three publicly available data sets: Chair-CAD, which consists of 1,393 3D models of chairs (out of which 820 were chosen with the first 16 frames); Weizmann Human Action, which provides 10 different actions performed by 9 people, amounting to 90 videos; and the Golf scene data set, which contains 20,268 golf videos (out of which 500 videos were chosen).
The researchers say that, compared with the videos generated by several baseline models, their system produced “visually more appealing” videos that “maintained consistency” with sharper frames. Moreover, it reportedly demonstrated a knack for frame interpolation, or a form of video processing in which the intermediate frames are generated between the existing on in an attempt to make animation more fluid.