Whether it’s language, music, speech, or video, sequential data isn’t easy for AI and machine learning models to comprehend — particularly when it depends on extensive surrounding context. For instance, if a person or an object disappears from view in a video only to reappear much later, many algorithms will forget how it looked. Researchers at Google set out to solve this with Transformer, an architecture that extended to thousand of words, dramatically improving performance in tasks like song composition, image synthesis, sentence-by-sentence text translation, and document summarization.
But Transformer isn’t perfect by any stretch — extending it to larger contexts makes apparent its limitations. Applications that use large windows have memory requirements ranging from gigabytes to terabytes in size, meaning models can only ingest a few paragraphs of text or generate short pieces of music. That’s why Google today introduced Reformer, an evolution of Transformer that’s designed to handle context windows of up to 1 million words. By leveraging techniques like locality-sensitive hashing (LSH) and reversible residual layers to use memory efficiently and reduce complexity over long sequences, it’s able to run on a single AI accelerator chip using only 16GB of memory.
The code and several example applications are available in open source, ahead of the Reformer paper’s presentation at the 2020 International Conference on Learning Representations in Addis Ababa, Ethiopia in April.
As with all deep neural networks, Transformers contain neurons (mathematical functions) arranged in interconnected layers that transmit signals from input data and slowly adjust the synaptic strength (weights) of each connection. That’s how all AI models extract features and learn to make predictions, but Transformer uniquely has attention such that every output element is connected to every input element. The weightings between them are calculated dynamically, in effect.
As my colleague Khari Johnson notes, one of the biggest machine learning trends of 2019 was the continued growth and proliferation of natural language models based on this Transformer design. Google open-sourced BERT, a Transformer-based model, in 2018. And a number of the top-performing models released this year, according to the GLUE leaderboard — including Nvidia’s Megatron, Google’s XLNet, Microsoft’s MT-DNN, and Facebook’s RoBERTa — were based on Transformers. XLNet 2 is due out later this month, a company spokesperson recently told VentureBeat.
Reformer, then, computes hash functions (functions used to map data of arbitrary size to fixed-size values) that match up similar vectors (algebraic constructions used to represent human-readable data in machine learning) instead of searching through all possible pairs of vectors. (For example, in a translation task, where each vector from the first layer of the network represents a word, vectors corresponding to the same words in different languages may get the same hash.) When the hashes are assigned, the sequence is rearranged to bring elements with the same hash together and divided into segments to enable parallel processing. Attention is then applied within these much shorter segments and their adjoining neighbors, greatly reducing the computational load.
Reformer also recomputes the input of each layer on-demand rather than storing it in memory, thanks to the aforementioned reversible memory. Activations — functions that determine the output of the network, its accuracy, and its computational efficiency — from the last layer of the network are used to recover activations from any intermediate layer, using two sets of activations for each layer. One is progressively updated from one layer to the next, while the other captures only the changes to the first.
“Since Reformer has such high efficiency, it can be applied directly to data with context windows much larger than virtually all current state-of-the-art text domain [data sets],” wrote contributing researchers Łukasz Kaiser, a Google staff research scientist, and Nikita Kitaev, a student at the University of California, Berkeley, in a blog post. “Perhaps Reformer’s ability to deal with such large datasets will stimulate the community to create them.”
The research team experimented with Reformer-based models on images and text, using them to generate missing details in images and process the entire novel Crime and Punishment (which contains 211,591 words). They show that Reformer can generate full-frame images pixel by pixel, and that they can take in novel-length text in a single round of training.
The authors leave to future work applying the technique to even longer sequences and improving their handling of positional encodings. “We believe Reformer gives the basis for future use of Transformer models, both for long text and applications outside of natural language processing,” added Kaiser and Kitaev.
In an interview late last year, Google AI chief Jeff Dean told VentureBeat that larger context would be a principal focus of Google’s work going forward. “We’d still like to be able to do much more contextual kinds of models,” he said. “Like right now BERT and other models work well on hundreds of words, but not 10,000 words as context. So that’s kind of [an] interesting direction.”
Reformer would appear to be a promising first step in that direction.