Appearance
PSDG Technical Report — Summary
PSDG — Philosopher's Stone Dice Game. An exact solver can lose to a worse opponent — not from noise, randomness or hidden information, but because the rules decouple placement from possession. The act of choosing is long over by the time its consequences are realized. PSDG is the smallest game where this gap is exactly measurable.
A children's dice game that should not exist
In one sentence: PSDG is a small, exact, checkable environment where realized outcomes under a published benchmark protocol can still favour a suboptimal opponent—not from noise or hidden information, but from commitment structure (irrevocable draft, static vs re-solving at the Exchange, and Exchange timing)—with the sharpest over-optimal-vs-optimal B wins coming from static principal-line commitment, not from the re-solving row (see snapshot).
Philosopher's Stone Dice Game is a two-player tabletop game — six dice, a small mat, roughly 10–15 minutes, ages 8 and up. After a random setup, every decision is deterministic with perfect information. The state space is small enough for exhaustive enumeration. A perfect solver exists (reference implementation in Python, with independent reimplementations used for parity on the seeded suites where published).
In a standard blunder benchmark — B's last draft pick is off the solver's principal line, six board dice, 5,000 seeded games — a suboptimal B still appears in the win column about 6–9% of the time depending on protocol: how A implements "optimality" at the Exchange (static principal-line commitment vs re-solving on the realized crucibles) and whether the Exchange is run sequentially or simultaneously. Exact rates appear in the table below; interpret the re-solving row with the note under that table.
That is not a violation of classical minimax for unconstrained optimal play from the initial position. It is a measurable gap between exact value under a fixed embedding and realized outcomes when play leaves the principal storyline and deployment rules (what A commits to at the Exchange after deviation, and Exchange timing) differ. Readers who expect "perfect solver ⇒ never loses to a weaker player" are usually picturing one notion of optimality; the benchmark makes protocol explicit and counts ex post results under P.
Researcher note · novelty & scope
PSDG is not presented as an entirely new catalog of failure modes. Proxy misspecification, latent structure, and deployment brittleness are each familiar in isolation. What is distinctive is that all three live in one small, exactly solved, reproducible environment with a shared oracle—the parable for proxy / representation stress, the full game for latent commitment structure, and the seeded benchmarks for measurable deployment brittleness—so their interaction is concrete and checkable rather than spread across separate papers, noisy demos, or toy setups. That yields a tight conceptual counterexample (e.g. exact value does not imply deployment-safe play) and an exact diagnostic benchmark.
One implication may be novel as a demonstrated result: the failure survives not only perfect optimization but perfect rule knowledge and complete visibility—the conditions under which human-in-the-loop oversight is often supposed to work. The problem is structural, not a limitation of attention, scale, or compute.
Public artifact (to our knowledge). PSDG is, to our knowledge, the first public, compact, perfect-information (after setup), exactly solved environment where static vs re-solving (and related blunder / deployment) win-rate splits under the published benchmark protocol are reproducible from published seeds and an exact reference implementation—oracle-grounded values and optimal play, not approximate equilibria or learned policies. The phenomenon is not new in the abstract; what is distinctive is a clean, checkable, public measurement tied to exact play. Corrections welcome.
Same framing (intentional duplicate): Home — In brief
PSDG sits at the intersection of game theory, ML evaluation, and alignment. These are not separate stories, but different routes into the same benchmark under the same oracle—complementary lenses, not fragmented subprojects (ML · AI safety · Game theory; Home — Audience routes).
Adjacent literature: Related work — how PSDG sits next to Concrete Problems, Safety Gridworlds, specification gaming, and OpenSpiel.
ML pillar map: how the benchmark stresses data, objectives, deployment, oversight, etc.—PSDG for ML — Research pillars.
What makes this possible
PSDG is built on three interlocking mechanisms.
Irrevocable early commitments with delayed consequences. During the draft, each player Twists a die to set both a top value (scored in Phase 1) and a facing value (which becomes the new top after the Tumble, scored in Phase 2). These commitments are locked in before the Exchange, before Phase 2 scoring, and before tiebreaker activation. They cannot be revised.
Latent rules that activate conditionally. The Poisoned Gift has eligibility constraints that depend on duplicate patterns across four dice. The Immortal tiebreaker depends on the Red Crystal's facing — a value set at random during setup that can be irrelevant throughout normal play until it decides the game. Phase 2 scoring depends on facings committed during the draft. None of these features are secret. All are visible. But they are structurally invisible to any system that encodes state as "what scores now" alone. The Exchange is where deployment (static vs re-solving) bites after deviation; eligibility couples Phase‑1‑friendly crucible patterns to later narrowed legal gifts. Longer framing: AI safety — Poisonous System Gift.
Simultaneous Exchange (canonical v1.13) changes information structure at one node. Under simultaneous reveal, equilibrium analysis can involve mixed strategies at the Gift; sequential variants change who observes what before acting. That is game-theoretic structure at the Exchange — not the same claim as "the solver is not deterministic." The project's exact solver still computes values and legal continuations under the stated embedding; the empirical splits isolate how much deployment and timing move outcomes when opponents play off the principal line.
Commitment before legibility (still deterministic, still perfect information)
After setup, PSDG is deterministic and perfect-information in the textbook sense: there are no mid-game draws and no private faces—both players can see the full configuration. What can still feel like “luck,” “gambling,” or even poker to newcomers is not hidden randomness after the roll; it is rule-staged legibility: Twists fix facings that matter for Phase 2, eligibility, and Immortal before the Exchange and Tumble make the payoff of those commitments obvious from a coarse “tops now” summary. Players must commit when the consequences of that commitment are not yet salient—evaluative pressure without smuggling in incomplete information in the formal sense.
Read narrowly, “uncertainty staged by the rules” means which branch of the fully public tree matters and what belongs in the state for the true objective—not that PSDG falls outside the usual extensive-form object class. The canonical simultaneous Exchange is still modeled as a simultaneous-move node in a perfect-information game (synchronized choice, with its own equilibrium analysis)—not as ex post hidden cards. Poker remains a useful metaphor for psychological commitment pressure, not a claim that PSDG instantiates beliefs over private information after setup. The benchmark is built so that setup is the only chance node, then exact analysis applies to a game whose strategic difficulty comes from representation, timing, and protocol—see also Home — Ready, fire, aim and FAQ — Is PSDG “really” perfect information?.
Three layers, each independently meaningful
The project separates three threads. They stack, but upper layers are not prerequisites for lower ones.
Layer 1 — Representation failure (the parable). A 3-armed bandit with a tiebreaker. The agent trains on visible top-face reward, converges to Q = 1.00, picks The Lure (the highest top), and loses when the tiebreaker scores the bottom face. The Rule of 7 (opposite faces sum to 7) means the best training pick is the worst deployment pick. No Exchange. No simultaneous moves. No draft. Just proxy reward versus latent structure.
Layer 2 — Deployment fragility (full game, sequential Exchange). Everything in Layer 1, plus: draft commitments were optimized for a storyline where the opponent follows the principal line. When the opponent deviates, those commitments are locked into a realized tree that may no longer match the ex ante plan. In the standard 5,000-game blunder suite, static A (principal-line Exchange commitment) loses to B 8.5% of the time when the Exchange is sequential — so the "frozen commitment gets punished" story does not require simultaneity at the Gift. Re-solving at the Exchange reduces B's win rate to 5.7% (287/5000); that row is not “minimax refuted”—see the 5.7% note below (B wins confined to B-win roots; 287 < 399 optimal-vs-optimal B wins).
Layer 3 — Information structure (full game, simultaneous vs sequential Exchange). Everything in Layers 1 and 2, plus: simultaneous reveal at the Gift defines a different information and equilibrium node than sequential play. Holding static commitment at the Exchange fixed, the suite reports 8.5% B wins (sequential) vs 6.9% (simultaneous) — an empirical contrast in timing / information structure, not the whole deployment story.
Important: the 5.7% re-solving row is reported for simultaneous or sequential Exchange in the standard table — the same count applies to both timings. So 5.7% is not "the simultaneity floor"; it is the blunder-suite rate when A re-solves at the Exchange under the published protocol, with Exchange timing not load-bearing for that headline number. Simultaneity sharpens some contrasts (e.g. 8.5% vs 6.9% when static); it is not the essence of the representation or deployment theses.
Pairwise reading of the standard blunder table (six dice, B blunders last draft):
| Solver mode | Exchange | B wins | Rate |
|---|---|---|---|
| Re-solving (optimal at Exchange) | Simul. or sequential | 287 | 5.7% |
| Static (A commits from principal line) | Sequential (B best-responds) | 427 | 8.5% |
| Static (A commits from principal line) | Simultaneous (B plays Nash) | 347 | 6.9% |
The 5.7% row — re-solving at the Exchange. That figure is optimal Exchange play on the realized crucibles after B’s last-draft blunder (the standard table uses one count for simultaneous or sequential Exchange). Draft minimax already priced all legal last twists; a random off-line twist is no improvement for B in value. So these B wins should not come from openings scored A-win or draw at the root under the same embedding—empirically 284/287 pair with value == -1 in a full cross-join (see verify_blunder_root_value_crossjoin.py and output/blunder_resolving_vs_root_value_5000_2026-04-05.txt; 3 seeds mismatch for a principal-line quirk). 287 vs 399 leaves 112 games where the blunder cost B a win. Irrevocable earlier draft picks still matter for which world you are in, but this row is not the same phenomenon as static A, where B wins can exceed the optimal-vs-optimal rate.
Comparisons must keep which rows differ only in one knob explicit — e.g. 8.5% vs 6.9% isolates sequential vs simultaneous with static A; 6.9% vs 5.7% compares static simultaneous to re-solving but note the 5.7% row pools Exchange timings.
Tiebreak depth (optimal-vs-optimal, 5k suite)
Each row in benchmark_5000_6d.json stores tb_depth on the principal line: 0 = winner after Phase 2 (no Immortal); 1–3 = deepest Immortal step present in the score breakdown. A small TB2 slice is expected: many post–Phase-2 ties resolve at TB1; many others require the full triple tumble. All 938 draws have tb_depth 3 (still tied after all three steps).
tb_depth | Games | % of 5,000 | Meaning (principal line) |
|---|---|---|---|
| 0 | 3588 | 71.76% | Decided after Phase 2 — no Immortal |
| 1 | 346 | 6.92% | Tiebreak includes TB1 only (resolved there) |
| 2 | 75 | 1.50% | Reaches TB2 in breakdown (resolved there) |
| 3 | 991 | 19.82% | Full triple in breakdown — 938 draws + 53 wins decided only at TB3 |
The intelligence threshold (random legal vs blunder suite)
A separate stress test pits the solver against a uniformly random legal opponent (Node driver optimal_vs_random_legal.js in the internal checkout under private/psdg/benchmark/—not in the public psdg tree, which has no private/ directory). One published batch: 10,000 games, six dice, seeds 42–10041 — A wins 9965 (99.65%), 31 draws, B wins 4; all four B wins came from openings where the opening oracle value already favoured B (0 wins from positions where A was winning or drawing under optimal play). See the log in psdg: benchmark/output/optimal_vs_random_legal_batch10000_seed42.txt.
The blunder suite is a different embedding: B is not uniform random; emphasis is on principal-line deviation and Exchange / deployment choices. The contrast is not "noise vs intelligence" in the abstract — it is which benchmark and which protocol. Any pop summary should name both the random-legal script and the blunder script rather than collapsing them.
Implications for ML
Unlike chess, the board is not the state. Two identical-looking boards can require completely different optimal play because the facings — committed during the draft — encode the Phase 2 future. Any learner that represents state as visible top values alone will alias positions with different optimal continuations.
The reward signal emphasises the wrong features. Phase 1 gold is visible, immediate, and high-frequency. Phase 2 consequences, eligibility constraints, and tiebreaker activation are delayed, conditional, and low-frequency. A gradient-based learner will converge on the proxy unless the objective or representation is fixed.
Credit assignment is corrupted by adversarial injection. An agent's final score reflects both its own draft choices and the opponent's Exchange gift — an object it did not select, constrained by rules that can force harmful transfers. The learning signal does not decompose cleanly into "what I caused" versus "what was done to me." The solver sidesteps this through exhaustive evaluation over the tree; a learning agent must infer structure from entangled outcomes.
The oracle is the ruler, not the opponent. The research use of the solver is not to invite agents to "beat minimax" in a solved game. It is to measure imperfect play: per-move regret, legality rate, optimal action rate, and outcome splits under different protocols. Imperfect play against exact truth is the point.
Implications for AI safety
Three handles (detail: AI safety — Three layers, three handles): misspecification → fix the objective / representation; deployment → fix the protocol; irrevocable draft → no post-hoc patch inside the rules—accept the cost or choose better ex ante.
Proxy misspecification is measurable. The Mortal agent converges to Q = 1.00 on the training objective and loses catastrophically at deployment. This is not a thought experiment; it is a runnable demo with exact numbers.
More optimisation sharpens failure, not safety. Any method optimising the wrong objective can converge more tightly to the wrong policy. The problem is the objective (and representation), not insufficient computation alone.
Frozen deployment is fragile. Even a policy derived from the exact oracle can be exploited when the opponent deviates from the expected path under a static Exchange implementation — 8.5% (sequential) and 6.9% (simultaneous) in the standard table are concrete rates, not vibes.
Misreported state causes non-monotonic harm. An illegal-upgrade stress test — where one draft pick is misreported as a 6 — shows modest impact at four board dice but a large win-rate drop at six dice when subsystems couple. This is relevant when thinking about observation tampering and reward hacking.
Human-in-the-loop (HIL) does not automatically escape the structural bind. PSDG stress-tests oversight under favourable conditions: public rules, perfect information after setup, no hidden mechanics, tabletop scale. An overseer still shares the learner’s difficulty when the visible tableau aliases distinct points in the full extensive form—certifying “safe to commit here” needs retrograde clarity comparable to the solver, not only attention. Detail: AI safety — Human oversight.
Detectors and postmortems still owe oracle-grade bookkeeping. A monitor that would reliably flag “ready–fire–aim” commitment traps without re-embedding counterfactual structure may be unbuildable in the strong specification (reliable, domain-general, oracle-free); the suite operationalises the honest baseline by measuring against the oracle, not by pretending a shallow rule substitutes for it. Post hoc investigation of the final board is no automatic rescue either: every die may remain visible, yet explaining why a draft was wrong still pulls toward branches not realized—same gap as live detection, framed as forensics. Detail: AI safety — PSDG detectors without the oracle, Postmortems and the final board.
The game diagnoses the problem. It does not solve it. The Oracle wins not because it is smarter in a folk sense but because it is given the full spec. How to obtain structural fidelity from data alone remains an open problem.
Implications for game theory
Static commitment versus re-solving is measurable. The gap between 8.5% / 6.9% (static, by Exchange timing) and 5.7% (re-solving) quantifies part of what it costs to freeze Exchange play after deviation vs re-optimising at the realized node.
Information structure is measurable. The 8.5% vs 6.9% split (static A; sequential vs simultaneous Exchange) isolates Exchange timing holding static commitment fixed.
Off-equilibrium play is first-class. Classical analysis often fixes an equilibrium path. PSDG asks what happens when realised play leaves that path and commitments are or are not revised. The blunder suite makes this empirical rather than purely narrative.
Summary
PSDG simultaneously illustrates proxy misspecification, latent commitment structure, deployment brittleness, the structural limits of human oversight, and why reliable forensics or monitoring without oracle-grade counterfactual bookkeeping is a hard problem—all in one tiny, exactly solved, empirically benchmarked game.
PSDG is a deterministic dice game with an exact solver and seeded benchmarks. A blundering opponent still appears in the win column at rates that depend on protocol — about 6–9% in the standard three-row table — while optimal-vs-optimal play on the same suite is the baseline (~8.0% B wins). Only the static rows show B winning more often than that baseline; the re-solving row is below it. A random legal opponent is a different stress test (very high solver win rate). The project treats three layers as separable: representation (parable), deployment (static vs re-solve; draft irreversibility), and Exchange information structure (sequential vs simultaneous; Nash-style node when simultaneous). Together these make PSDG a compact, checkable benchmark for proxy misspecification, commitment brittleness, and the gap between optimisation under a proxy and structural understanding of the true game. Clairvoyance here means full retrograde clarity over the tree — not mysticism, but complete foresight over legal continuations and payoffs.
AI safety conclusion
A compressed version of the AI safety page — conclusion: the salient safety question is not only how smart the system is or who watches it, but what happens when commitment precedes knowledge—when something irreversible is fixed before the governing structure is fully knowable.
PSDG-shaped pockets (irrevocable early moves, phase-dependent observables, entangled simultaneous choices) appear in many real domains as rhymes, not as identities. Misspecification and deployment admit levers; the time-arrow part often comes down to foresight or accepting residual risk once commitments are locked. The ~5.7% re-solving row counts B wins that survive a random last-draft blunder mostly or entirely from openings already B-favour under the oracle—not a second “tax” on top of minimax in the same sense as static 8.5% / 6.9% vs ~8.0%.
Background — independent researcher
PSDG was designed by Rob McCormack, an independent researcher. The work began during Covid, motivated in part by public discussion of AI risk. An earlier project — Entropy Checkers — explored related themes through adversarial reward inversion but lacked an exact solver strong enough to make claims easy for third parties to verify. PSDG was built to close that gap: a game small enough to solve exhaustively, with tooling so that the solver produces numbers and benchmarks produce counts others can re-run.
Closing stakes (informal)
If a children’s dice game — six dice, about ten minutes, fully checkable after setup — can host a benchmark where exact solving still leaves a wedge between optimal-vs-optimal play and static principal-line deployment at the Exchange (8.5% / 6.9% B wins vs ~8.0% on the same suite), then many deployed systems that share those ingredients — proxy objectives, irrevocable early commitments, frozen continuations when the world deviates — are carrying an analogous risk at scale, in environments where you cannot exhaustively close the verification gap. (Numbers are embedding-specific; see the canonical table.)
The uncomfortable part isn’t that PSDG is complicated. It’s that it’s simple. The state space is tiny. The rules fit on one site page. There’s a perfect oracle. And the failure still shows up — not at the margins of some enormous system, but in a game an eight-year-old can play.
The standard response to alignment concerns is usually “we need more scale, better training, stronger optimization.” PSDG includes cases where more optimization on the wrong target tightens the mistake (parable); and deployment splits (static vs re-solving; Exchange timing) are not cured by “more search” alone on a frozen protocol — much of the issue is upstream of the optimizer: what you represent, what you reward, and what you committed to before the realized path was known. Some of that is design you can change; some is irrevocability inside the spec (three handles).
So — if you take the measured results seriously — they rhyme with most systems that optimize a proxy objective, freeze a policy before deployment, or represent state by what is visibly salient rather than what is structurally relevant. The benchmark does not by itself prove universal claims about any particular production stack; it does make a cheap place to disprove the comforting story that scale and gradients automatically wash this class of problem out.
On this site: Home · FAQ · Empirical snapshot · Parable · ML · AI safety · Game theory
Rules: v1.13 (Rules)
Blunder benchmark: 5,000-game seeded suite (seeds 42–5041, six dice) — canonical table and commentary on Home — Empirical snapshot.
Random legal baseline: Node driver optimal_vs_random_legal.js (dev monorepo) — example output benchmark/output/optimal_vs_random_legal_batch10000_seed42.txt in psdg.
