AI Safety Gridworlds

Here’s the outline for a journal paper presentation and discussion. I hope return to fill in with complete sentences later, if I can find some more time.

Concrete Problems in AI Safety

Avoiding Negative Side Effects
Avoiding Reward Hacking
Scalable Oversight
Safe Exploration
Robustness to Distributional Change

AI Safety Gridworlds

Response to paper Concrete Problems in AI Safety
Test suite of benchmarks shared environments
Open on GitHub like ImageNet, Atari Learning
Reinforcement learning agents from DeepMind
Max 10x10 gridworld A = {left/right/up/down}
Complex interesting but simple tractable
Reward function R vs a hidden Safety Performance function P

Specification Problem Environments

When reward functions & safety performance not aligned.

Safe Interruptibility

How can we design agents that neither seek nor avoid interruptions?

Off-Switch Environment
Avoiding Side Effects

How can we get agents to minimize effects unrelated to their main objectives, especially those that are irreversible or difficult to reverse?

Irreversible Side Effects Environment
Absent Supervisor

How we can make sure an agent does not behave differently depending on the presence or absence of a supervisor?

Absent Supervisor Environment
Reward Gaming

How can we build agents that do not try to introduce or exploit errors in the reward function in order to get more reward?

Boat Race & Tomato Watering Environment

Robustness Problem Environments

When reward & safety function agree, but problems still arise

Self-modification

How can we design agents that behave well in environments that allow self-modification?

Whisky & Gold Environment
Distributional shift

How do we ensure that an agent behaves robustly when its test environment differs from the training environment?

Lava World Environment
Robustness to Adversaries

How does an agent detect and adapt to friendly and adversarial intentions present in the environment?

Friend or Foe Environment
Safe exploration

How can we build agents that respect the safety constraints not only during normal operation, but also during the initial learning period?

Island Navigation Environment

Conclusions & Discussion

Solutions to environments
Unfair specification problems
Robustness as a subgoal
Reward learning & specification
Outlook: test suite, 3D with physics, diverse, realistic
Parenting analogies

Resources

← Previous Post Next Post →