A team of Facebook AI Research researchers recently developed an algorithm — other-play (OP) — that constructs strategies that achieve high returns with agents that weren’t trained with other agents (a form of zero-shot coordination). If the claims pan out, the work could greatly improve the performance of autonomous cars, which leverage zero-shot coordination to navigate around obstacles and drivers on the road.
The researchers studied Markov games, or games that depend only on variables that summarize the game history in one way or another. (For instance, a variable can be the current play in a repeated game or it can be any interpretation of a recent sequence of play.) The games are partially observable and their players — AI-driven agents — share a reward conditioned on the joint actions they take and the game state. Thus, the goal is to maximize the expected return.
In experiments, the team applied OP — which uses a problem description to coordinate agents as opposed to action labels — to a lever game where an agent is required to coordinate with an unknown stranger by choosing one from a set of 10 levers. They say that both during the training phase and at test time, the OP agents carried out zero-shot coordination when paired with other OP agents. By contrast, self-play agents, which play the game against each other to discover strategies, achieved higher rewards during the training phase but failed to coordinate with other, independently trained self-play agents.
The researchers next applied OP to the cooperative card game Hanabi. In Hanabi, players are dealt cards from a hand containing five cards in total. Each turn, they must (1) reveal the suit or number of the card in another player’s hand, (2) discard a card, or (3) play a card that’s either a “1” in a suit that hasn’t been played or the next number sequentially in a suit that has been played. The goal is to achieve the top score by value of the highest cards in each suit played — a task more challenging than it sounds. Revealing information about cards consumes one of eight available information tokens, which can only be replenished with a discard or the successful play of a “5” of any suit. Meanwhile, a failed card play consumes one of three available fuse tokens.
According to the researchers, OP improved cross-play such that it eliminated the “inhuman” conventions that emerge in self-play, which are often difficult (or impossible) for humans to understand. (For example, without OP, a self-play agent might hint a certain color to indicate it discarded a card, while its partner interprets this as playing another card.)
In one final experiment, the researchers paired OP-assisted agents with human players in Hanabi, all 20 of whom were recruited from a board game club and none of whom were expert players. They say that the agents significantly outperformed the state-of-the-art self-play agent, winning 15 out of the 20 per-seed comparisons and tying in 2 cases. “These results do not suggest that OP will work in every zero-shot coordination where AI agents need to cooperate with humans,” wrote the study’s coauthors. “However, they are encouraging and suggest that OP is a fruitful research direction for the important problem of human-AI coordination.”
The researchers are cautious not to claim that OP is a silver bullet for all zero-shot coordination problems. However, they say that it represents an “exciting new research direction for those interested in moving deep reinforcement learning beyond two-player, zero-sum environments to ones involving coordination and cooperation.
“We have shown that a simple expansion of self-play … can construct agents that are better able to zero-shot coordinate with partners they have not seen before,” they said.