Related work

This page is a non-exhaustive map for readers who already know the safety and games literature. It is not a systematic survey. On this site, the technical report (summary) still carries the main measurable claims; novelty & scope on the FAQ overlaps with the opening here.

Introduction

PSDG is not offered as the first small environment to expose safety or specification failures, nor as the first exact benchmark built around games. The contribution is narrower and, in some ways, more operational: several familiar concerns are concentrated in one small, human-readable, exactly solved, reproducible environment with a shared oracle. PSDG is best read next to adjacent traditions rather than as a freestanding break from them.

1. Practical AI safety problem framing

Concrete Problems in AI Safety — Amodei, Olah, Steinhardt, Christiano, Schulman, Mané (2016).

That paper helped shape the modern practical safety agenda: avoiding reward hacking and side effects, scalable supervision, safe exploration, distributional shift, and related accident vectors (framed there as five problem clusters). Its role is largely conceptual—a vocabulary for how capable systems can still go wrong under a stated objective. It is not, however, a single small exact game with a shared oracle and pinned deployment protocols.

2. Small environments for safety failures

AI Safety Gridworlds — Leike, Martic, Krakovna, Ortega, Everitt, Lefrancq, Orseau, Legg (2017, DeepMind).

A closer precedent in spirit is AI Safety Gridworlds: compact RL environments illustrating problems such as safe interruptibility, avoiding side effects, absent supervisor, reward gaming, safe exploration, and robustness to self-modification, distributional shift, and adversaries. Each environment uses a performance function hidden from the agent to score “intended” behavior—so observed reward can diverge from true performance—kin to PSDG’s proxy vs latent structure theme.

The main difference is shape of artifact: Gridworlds is a suite of toy worlds for training-time behavior; PSDG is one small, tabletop-checkable, fully specified game with exact values and explicit deployment rows (static vs re-solving, Exchange timing).

A further distinction is evaluation substrate. Gridworlds scores agents against hidden objectives; it does not, by design, center a queryable oracle that returns exact game-theoretic value, optimal actions, and per-move regret at arbitrary nodes. PSDG’s oracle is a shared ruler not only for “did it win?” but for by how much each decision missed ground truth and which protocol was in force—so many disagreements can be chased to code and seeds rather than narrative alone.

3. Specification gaming and proxy failure

Specification gaming: the flip side of AI ingenuity — Krakovna, Uesato, Mikulik, Rahtz, Everitt, Kumar, Kenton, Leike, Legg (Google DeepMind blog, 2020).

Specification gaming satisfies the letter of an objective without the intent. That line of work is usually a gallery of examples across domains. PSDG sits in the same lineage while aiming at a different packaging: one compact, exact place where mismatch can be checked against oracle truth—see also the Mortal vs Oracle parable on this site.

4. Game-theoretic and benchmarking infrastructure

OpenSpiel: A Framework for Reinforcement Learning in Games — Lanctot et al. (2019).

OpenSpiel is infrastructure: many games, algorithms, and evaluation patterns—including simultaneous and imperfect-information settings. PSDG is not a generic multi-game framework; it is one deliberately small environment aimed at representation, irrevocable commitment, and deployment brittleness with pinned claims (e.g. empirical snapshot, game theory).

5. What PSDG appears to add

Against that background, the novelty looks combinatorial—tight concentration—more than a claim to have invented a wholly new failure class.

The citations above do not, taken alone, supply this project’s combination: an exact oracle that returns value, optimal actions, and regret under a published embedding, plus protocol-variation rows that separate optimization under a proxy from deployment-safe realized play. That oracle is what lets PSDG pin many statements to mathematics and reproducible runs, not only to learned behavior trends.

Five properties co-occur here in one benchmark:

Small and human-readable — the object is inspectable without a large codebase.
Exactly solved — claims can be tied to oracle truth, not only to sample paths.
Latent commitment structure — a visible “board photo” need not summarize the state the rules actually care about (rules in brief).
Deployment-protocol variation — ex ante value vs ex post outcome under static vs re-solving and timing choices (deployment gap).
Reproducible — seeded suites; disputes can be narrow and technical.

Together, that supports a sharp counterexample to a strong sufficiency story: exact value under one embedding does not, by itself, imply deployment-safe play under another explicit protocol—a theme the technical report states with numbers.

Conclusion

PSDG is best read as a small exact benchmark where several familiar problems are jointly visible under one oracle and one reproducible protocol. Nearest neighbors include practical safety framing, Safety Gridworlds, specification-gaming expositions, and multi-game frameworks such as OpenSpiel. What reads as distinctive is the concentration: proxy misspecification, state aliasing, and deployment brittleness in a single checkable game—and an oracle that makes many claims falsifiable with modest tooling.

On this site

Home · Technical report (summary) · FAQ — novelty & scope · PSDG for ML · PSDG for AI safety · PSDG for game theory

Related work ​

Introduction ​

1. Practical AI safety problem framing ​

2. Small environments for safety failures ​

3. Specification gaming and proxy failure ​

4. Game-theoretic and benchmarking infrastructure ​

5. What PSDG appears to add ​

Conclusion ​

On this site ​