For this journal club, I wanted to move from the big-picture worries about AGI safety toward something more concrete. Two papers pair up nicely for that: “Concrete Problems in AI Safety,” which lays out a tidy list of practical safety challenges, and “AI Safety Gridworlds” from DeepMind, which turns those challenges into tiny game-like environments we can actually test agents in.
The first paper steps away from doomsday scenarios and asks a simpler question: what can go wrong with the systems we’re building right now? It groups the answers into five concrete problems.
DeepMind’s paper is a direct response to that list. Instead of arguing about safety in the abstract, they built a suite of simple benchmark environments, open on GitHub in the same spirit as ImageNet or the Atari Learning Environment. Reinforcement learning agents move around a grid no bigger than 10x10, choosing from four actions: left, right, up, and down. The environments are deliberately small and tractable, yet each one isolates a single interesting safety problem.
The clever twist is that every environment has two scoring functions. There’s the reward function R that the agent sees and tries to maximize, and a separate, hidden safety performance function P that we use to judge whether the agent actually behaved safely. The gap between those two is exactly where safety problems live. The environments fall into two families.
These are cases where the reward function and the safety performance function are not aligned. The agent can rack up reward while still doing the wrong thing.
Here the reward function and the safety performance function agree, yet problems still arise from how the agent learns or generalizes.
What I appreciate about the gridworlds is how they make fuzzy worries testable. When standard agents run in these environments, they often score well on the visible reward while quietly failing the hidden safety measure, which is the whole point. It shows that unsafe behavior isn’t some far-off science fiction problem, it falls right out of ordinary reward maximization.
A few themes stood out in our discussion. Many of these problems push us away from trying to control an agent after the fact and toward helping it learn what we actually value in the first place. Robustness starts to look less like a single fix and more like a subgoal worth building in from the start. And the authors are upfront that these toy grids are only a beginning. The hope is to grow the suite toward richer, more realistic settings, eventually 3D worlds with real physics. More than once we came back to parenting analogies, since a lot of this really is about how you guide a learner toward good behavior when you can’t spell out every rule in advance.