Skip to content

PSDG for AI Safety — At a Glance

On this site: Home · AI safety (full) · FAQ · Parable · Empirical snapshot

Even an exact solver can lose to a worse opponent in a game with no hidden information and no mid-game randomness — because irrevocable draft commitments fix a payoff-relevant future that the visible board does not yet make legible.

After setup, play is **deterministic** and **fully observed**—so the failure is hard to blame on luck, noise, or ambiguous labeling alone.

Learn basic play in about 3 minutes — Watch on YouTube. Tiebreak / Immortal takes a few more minutes — demo & script.

The result: On the published 5,000-game blunder suite, strictly worse play still defeats an oracle-derived static policy ~5.7%–~8.5% of the time. How the table runs the midgame swap matters: lock in an advance plan vs update after the realized branch, and in turn vs at once at that decision. The three headline pins land near ~8.5%, ~6.9%, and ~5.7%. Reproducible from published seeds and an exact Python solver. See: empirical snapshot — full table with protocol labels (P) where the site pins them formally.

PSDG playmat (empty board)

The Playmat

Six gray dice in the draft pool row after setup

The complete board after initial setup; deterministic play begins here.

The mechanism: In PSDG, irrevocable draft commitments fix a future that is not yet legible from the visible board. A policy that optimizes the salient signal (current scoring) can converge to the wrong objective while remaining blind to the latent structure that actually determines the winner. The failure is structural — it survives perfect optimization, perfect rule knowledge, and full observability.

Why this matters to AI safety research:

  • Proxy misspecification is measurable. The parable shows a Q-learning agent converging to Q = 1.00 on a visible reward and then losing catastrophically when a latent tiebreaker activates.
  • Deployment brittleness is protocol-dependent. Freezing an ex ante plan (static Exchange) yields an 8.5% loss rate; re-solving on the realized node cuts it to 5.7%. The difference is what you committed to, not how well you optimized.
  • Human-in-the-loop does not automatically escape the bind. An overseer shares the learner's difficulty: the visible board aliases distinct positions in the true game tree. Care alone is not enough without retrograde clarity comparable to the oracle.
  • More optimization sharpens the mistake, not fixes it. The Mortal agent is provably optimal on the training objective. The problem is the objective, not the optimizer.

Deeper Dive:


If you take one thing:

If your overseer shares the agent’s representation of the world, what exactly are they checking?

Detailed report: Technical report (summary)