A preprint paper coauthored by scientists at Facebook AI Research describes Rewarding Impact-Driven Exploration (RIDE), an intrinsic reward method that encourages AI-driven agents to take actions in an environment. The researchers say that it outperforms state-of-the-art methods on hard exploration tasks in procedurally generated worlds, a sign it might be a candidate for devices like robot vacuums that must often navigate new environments.
As the researchers explain, reinforcement learning, where the goal is to spur an agent to complete tasks via systems of rewards, learn to act in new environments through trial and error. But many environments of interest — particularly those closer to real-world problems — don’t provide a steady stream of rewards for agents to learn from, requiring many episodes before agents come across rewards.
The researchers’ proposed solution, then — RIDE — drives agents to try out actions that have a significant impact on the environment.
The team evaluated RIDE in procedurally generated environments from the open source tool MiniGrid, where the world is a partially observable grid and each tile in the grid contains at most one object of a discrete color (a wall, door, key, ball, box, or goal). Separately, they tasked it with navigating levels in VizDoom, a Doom-based AI research platform for reinforcement learning. While VizDoom is visually more complex than MiniGrid, they’re both challenging domains in the sense that the chance of randomly stumbling upon extrinsic rewards is extremely low.
The researchers report that, compared with baseline algorithms, RIDE considers certain states to be “novel” or “surprising” even after long periods of training and after seeing similar states in the past or learning to almost perfectly predict the next state in a subset of the environment. As a consequence, its intrinsic rewards don’t diminish during training, and agents manage to distinguish between actions that lead to novel or surprising states from those that do not, avoiding becoming trapped in some parts of the state space.
“RIDE has a number of desirable properties,” wrote the study’s coauthors. “It attracts agents to states where they can affect the environment, it provides a signal to agents even after training for a long time, and it is conceptually simple as well as compatible with other intrinsic or extrinsic rewards and any deep [reinforcement learning] algorithm … Furthermore, RIDE explores procedurally generated environments more efficiently than other exploration methods.”
They leave to future work improving RIDE by making use of symbolic information to measure the agent’s impact or considering longer-term effects of the agent’s actions. They also hope to investigate algorithms that can distinguish between desirable and undesirable types of impact, effectively constraining the agent to act safely and avoid distractions.