Even the best text-parsing recommendation algorithms can be stymied by data sets of a certain size. In an effort to to deliver faster, better classification performance than the bulk of existing methods, a team at the MIT-IBM Watson AI Lab and MIT’s Geometric Data Processing Group devised a technique that combines popular AI tools including embeddings and optimal transport. They say that their approach can scan millions of possibilities given only the historical preferences of a person, or the preferences of a group of people.
“There’s a ton of text on the internet,” said lead author on the research and MIT assistant professor Justin Solomon in a statement. “Anything to help cut through all that material is extremely useful.”
To this end, Solomon and colleagues’ algorithm summarizes collections of text into topics based on commonly-used words in the collection. Next, it divides each text into its five to 15 most important topics, with a ranking indicating each topic’s importance to the text overall. Embeddings — numerical representations of data, in this case words — help make evident the similarity among words, while optimal transport helps to calculate the most efficient way of moving objects (or data points) among multiple destinations.
The embeddings make it possible to leverage optimal transport twice — first to compare topics within the collection and then to measure how closely common themes overlap. This works especially well when scanning large collections of books and documents, according to the researchers; in an evaluation involving 1,720 pairs of titles in the Gutenberg Project data set, the algorithm managed to compare all of them in one second, or more than 800 times faster than the next-best method.
Moreover, the algorithm does a superior job of sorting documents than rival methods, for example grouping books in the Gutenberg dataset by author and product reviews on Amazon by department. It’s also more explainable in that it provides lists of topics, enabling users to better understand why it’s recommending a given document.
The researchers leave to future work developing an end-to-end training technique that optimizes the embedding, topic models, and optimal transport jointly as opposed to separately, as with the current implementation. They also hope to apply their approach to larger data sets, and to investigate applications to the modeling of images or three-dimensional data.
“[Our algorithm] appears to capture differences in the same way a person asked to compare two documents would: by breaking down each document into easy to understand concepts, and then comparing the concepts,” wrote Solomon and coauthors in a paper summarizing their work. “[W]ord embeddings provide global semantic language information, while … topic models provide corpus-specific topics and topic distributions. Empirically these combine to give superior performance on various metric-based tasks.”