In a new study, researchers at Amazon describe a technique that factors in information about knowledge graphs to perform entity alignment, which entails determining which elements of different graphs refer to the same “entities” (which might be anything from products to song titles). The idea is to improve computational efficiency while at the same time improving performance, speeding up graph-related tasks like product searches on Amazon and question answering via Alexa.
The work, which was accepted to the 2020 Web Conference, might also benefit graphs beyond Amazon, such as those that underpin social networks like Facebook and Twitter, as well as graphs used by enterprises to organize various digital catalogs.
As Amazon product graph applied scientist Hao Wei explains in a blog post, the advantage of knowledge graphs — mathematical objects consisting of nodes and edges — is that they can capture complex relationships more easily than conventional databases. (For example, in a movie data set, a node might represent an actor, a director, a film, or a film genre, while the edges represent who acted in what, who directed what, and so on.) Expanding a graph often involves integrating it with another knowledge graph, but different graphs might use different terms for the same entities, which can lead to errors.
Amazon’s proposed system is a graph neural network, where nodes are converted to a fixed-length vector representation that captures information about attributes useful for entity alignment. The network considers the central node and the nodes nearby it, and for each of these nodes it produces a new embedding that consists of the node’s first embedding concatenated with the sum of its immediate neighbors’ embeddings. Additionally, the network produces a new embedding for the central node, which consists of that node’s embedding concatenated with the summation of the secondary embeddings of its immediate neighbors.
The researchers report that in tests involving the integration of two Amazon movie databases, their system improved upon the best-performing of 10 baseline systems by 10% on a metric called area under the precision-recall curve (PRAUC), which evaluates the trade-off between true-positive and true-negative rates. Furthermore, compared with a baseline system called DeepMatcher, which was specifically designed with scalability in mind, the Amazon system reduced training time by 95%.