Reinforcement learning agents — or AI that’s progressively spurred toward goals via rewards (or punishments) — form the foundation of self-driving cars, dexterous robots, and drug discovery systems. But because they’re predisposed to explore unfamiliar states, they’re susceptible to what’s called the safe exploration problem, wherein they become fixated on unsafe states (like a mobile robot driving into a ditch, say).
That’s why researchers at Alphabet’s DeepMind investigated in a paper a method for reward modeling that operates in two phases and is applicable to environments in which agents don’t know where unsafe states might be. The researchers say their approach not only successfully trains a reward model to detect unsafe states without visiting them, it can correct reward hacking (loopholes in the reward specification) before the agent is deployed — even in new and unfamiliar environments.
Interestingly, their work comes shortly after the release of San Francisco-based research firm OpenAI’s Safety Gym, a suite of tools for developing AI that respects safety constraints while training and that compares its “safety” to the extent it avoids mistakes while learning. Safety Gym similarly targets reinforcement learning agents with “constrained reinforcement learning,” a paradigm that requires AI systems to make trade-offs to achieve defined outcomes.
The DeepMind team’s approach encourages agents to explore a range of states through hypothetical behaviors generated by two systems: a generative model of initial states and a forward dynamics model, both trained on data like random trajectories or safe expert demonstrations. A human supervisor labels the behaviors with rewards, and the agents interactively learn policies to maximize their rewards. Only after the agents have successfully learned to predict rewards and unsafe states are they deployed to perform desired tasks.
As the researchers point out, the key idea is the active synthesis of hypothetical behaviors from scratch to make them as informative as possible, without interacting with the environment directly. The DeepMind team calls it reward query synthesis via trajectory optimization, or ReQueST, and explains that it generates four types of hypothetical behaviors in total. The first type maximizes the uncertainty of an ensemble of reward models, while the second and third maximize the predicted rewards (to elicit labels for behaviors with the highest information value) and minimize predicted rewards (to surface behaviors for which the reward model might be incorrectly predicting). As for the fourth category of behavior, it maximizes the novelty of trajectories so as to encourage exploration regardless of predicted rewards.
Finally, once the reward model reaches a satisfactory state, a planning-based agent is deployed — one that leverages model-predictive control (MPC) to pick actions optimized for the learned rewards. Unlike model-free reinforcement learning algorithms that learn through trial and error, this MPC enables agents to avoid unsafe states by using the dynamics model to anticipate actions’ consequences.
“To our knowledge, ReQueST is the first reward modeling algorithm that safely learns about unsafe states and scales to training neural network reward models in environments with high-dimensional, continuous states,” wrote the coauthors of the study. “So far, we have only demonstrated the effectiveness of ReQueST in simulated domains with relatively simple dynamics. One direction for future work is to test ReQueST in 3D domains with more realistic physics and other agents acting in the environment.”