In August, Tencent announced it had developed an AI system capable of defeating teams of pros in a five-on-five match in Honor of Kings (or Arena of Valor, depending on the region). This was a noteworthy achievement — Honor of Kings occupies the video game subgenre known as multiplayer online battle arena games (MOBAs), which are incomplete information games in the sense that players are unaware of the actions chosen by other players. The endgame, then, isn’t merely Honor of Kings AI that achieves superhero performance, but insights that might be used to develop systems capable of solving some of society’s toughest challenges.
A paper published this week peels back the layers of Tencent’s technique, with the coauthors describe as “highly scalable.” They claim its novel strategies enable it to explore the game map “efficiently,” with an actor-critic architecture that self-improves trains over time.
As the researchers point out, real-time strategy games like Honor of Kings require highly complex action control compared with traditional board games and Atari games. Their environments also tend to be more complicated (Honor of Kings has 10^600 possible states and and 10^18000 possible actions) and the objectives more complex on the whole. Agent must not only learn to plan, attack, and defend but also to control skill combos, induce, and deceive opponents, all while contending with hazards like creeps and fully automated turrets.
Tencent’s architecture consists of four modules: a Reinforcement Learning Learning (RL) Learner, an Artificial Intelligence (AI) Server, a Dispatch Module, and a Memory Pool.
The AI Server — which runs on a single processor core, thanks to some clever compression — dictates how the AI model interacts with objects in the game environment. It generates episodes via self-play, and based on the features it extracts from the game state, the ut predicts players’ actions and forwards them to the game core for execution. The game core then returns the next state and the corresponding reward value, or the value that spurs the model toward certain Honor of Kings goals.
As for the Dispatch Module, it’s bundled with several AI Servers on the same machine, and it collects data samples consisting of rewards, features, action probabilities, and more before compressing and sending them to Memory Pools. The Memory Pool — which is also a server — supports samples of various lengths and data sampling based on the generated time, and it implements a circular queue structure that performs storage operations in a data-efficient fashion.
Lastly, the Reinforcement Learner — a distributed training environment — accelerates policy updates with the aforementioned actor-critic approach. Multiple Reinforcement Learners fetch data in parallel from Memory Pools, with which they communicate using shared memory. One mechanism (target attention) helps with enemy target selection, while another — long short-term memory (LSTM), an algorithm capable of learning long-term dependencies — teaches hero players skill combos critical to inflicting “severe” damage.
The Tencent researchers’ system encodes image features and game state information such that each unit and enemy target is represented numerically. An action mask cleverly incorporates prior knowledge of experienced human players, preventing the AI from attempting to traverse physically “forbidden” areas of game maps (like challenging terrain).
In experiments, the paper’s coauthors ran the framework across a total of 600,000 cores and 1,064 graphics cards (a mixture of Nvidia Tesla P40s and Nvidia V100s), which crunched 16,000 features containing unconcealed unit attributes and game information. Training one hero required 48 graphics cards and 18,000 processor cores at a speed of about 80,000 samples per second per card. And collectively for every day of training, the system accumulated the equivalent of 500 years of human experience.
The AI’s Elo score, derived from a system for calculating the relative skill levels of players in zero-sum games, unsurprisingly increased steadily with training, the coauthors note. It became relatively stable within 80 hours, according to the researchers, and within just 30 hours it began to defeat the top 1% of human Honor of Kings players.
The system executes actions via the AI model every 133 milliseconds, or about the response time of a top-amateur player. Five players professional players were invited to play against it, including “QGhappy.Hurt,” “WE.762,” “TS.NuanYang,” “QGhappy.Fly, eStarPro.Ca,” as well as a “diversity” of players attending the ChinaJoy 2019 conference in Shanghai between August 2 and August 5.
The researchers note that despite eStarPro.Cat’s prowess with mage-type heroes, the AI achieved five kills per game but was killed itself only 1.33 times on average. In public matches, its win rate was 99.81% over 2,100 matches, and five of the eight AI-controlled heroes managed a 100% win rate.
The Tencent researchers say that they plan to make both their framework and algorithms open source in the near future, toward the goal of fostering research on complex games like Honor of Kings. They’re far from the only ones who plan to or who have already done such — DeepMind’s AlphaStar beat 99.8% of human StarCraft 2 players, while OpenAI Five’s OpenAI Five framework defeated a professional team twice in public matches.