Appearance
PSDG for AI Safety — At a Glance
On this site: Home · AI safety (full) · FAQ · Parable · Empirical snapshot
Even an exact solver can lose to a worse opponent in a game with no hidden information and no mid-game randomness — because irrevocable draft commitments fix a payoff-relevant future that the visible board does not yet make legible.
After setup, play is **deterministic** and **fully observed**—so the failure is hard to blame on luck, noise, or ambiguous labeling alone.
Learn basic play in about 3 minutes — Watch on YouTube. Tiebreak / Immortal takes a few more minutes — demo & script.
The result: On the published 5,000-game blunder suite, strictly worse play still defeats an oracle-derived static policy ~5.7%–~8.5% of the time. How the table runs the midgame swap matters: lock in an advance plan vs update after the realized branch, and in turn vs at once at that decision. The three headline pins land near ~8.5%, ~6.9%, and ~5.7%. Reproducible from published seeds and an exact Python solver. See: empirical snapshot — full table with protocol labels (P) where the site pins them formally.

The Playmat

The complete board after initial setup; deterministic play begins here.
The mechanism: In PSDG, irrevocable draft commitments fix a future that is not yet legible from the visible board. A policy that optimizes the salient signal (current scoring) can converge to the wrong objective while remaining blind to the latent structure that actually determines the winner. The failure is structural — it survives perfect optimization, perfect rule knowledge, and full observability.
Why this matters to AI safety research:
- Proxy misspecification is measurable. The parable shows a Q-learning agent converging to Q = 1.00 on a visible reward and then losing catastrophically when a latent tiebreaker activates.
- Deployment brittleness is protocol-dependent. Freezing an ex ante plan (static Exchange) yields an 8.5% loss rate; re-solving on the realized node cuts it to 5.7%. The difference is what you committed to, not how well you optimized.
- Human-in-the-loop does not automatically escape the bind. An overseer shares the learner's difficulty: the visible board aliases distinct positions in the true game tree. Care alone is not enough without retrograde clarity comparable to the oracle.
- More optimization sharpens the mistake, not fixes it. The Mortal agent is provably optimal on the training objective. The problem is the objective, not the optimizer.
Deeper Dive:
- Parable (~60 seconds) — the minimal case: optimal on the proxy, catastrophic on the true rule
- Q-learning / bandit demo — runnable code, exact regret numbers
- AI safety — full analysis — oversight bind, postmortem problem, three handles
- Empirical snapshot — the 5,000-game blunder table with protocol splits
- Solver + benchmarks (GitHub) — MIT license, reproduce the numbers
If you take one thing:
If your overseer shares the agent’s representation of the world, what exactly are they checking?
Detailed report: Technical report (summary)
