Appearance
PSDG for AI safety research
On this site: Home · FAQ · AI safety — at a glance · Mortal vs Oracle parable · Q-learning / bandit demo · Rules (v1.13) · YouTube tutorial · ML · Game theory · Blunder sweep
If you want the visceral version of the misspecification thesis first (Q = 1.00, The Lure, tiebreaker by Rule of 7), read the standalone Mortal vs Oracle parable—a tiny toy with no Exchange and no simultaneous moves. For commands, algorithm note, and “why not full Q-learning,” use Mortal vs Oracle — Q-learning / bandit demo. Then return here for benchmarks and protocol detail.
PSDG is a controlled micro-benchmark for objective robustness under strategic structure: policies can look competent on surface features while failing catastrophically when latent rules determine outcomes.
It does not claim to prove AGI risk, or argue against scaling or compute in general. It does make several standard design assumptions falsifiable with numbers when paired with the project’s exact oracle—a checkable place where proxy, protocol, and representation gaps show up as tables and regret.
Intended use of the benchmark, in one line: the load-bearing point is not “AI safety is hard in general,” but: the moment of commitment is structurally richer than the moment of observation—and most systems, including careful human-in-the-loop arrangements, behave as if those were the same. The project is built to make that gap legible and measurable under a pinned oracle, not to collapse into a vague “things are hard” story. See the empirical snapshot and conclusion below.
At a glance (what the rest of the page measures):
- Structural gap: strong optimization on salient signals does not guarantee fidelity to what actually governs payoff when latent rules or phases turn on; some failure modes are oracle-measurable when state is aliased.
- Protocol: what stays frozen, what is re-solved, and who observes what before acting changes exploitability even in this small, exact game.
Specialists can map those measurements to their own questions about urgency and real-world rhymes.
Epistemic trap (in a good sense)
PSDG is easy to underestimate: short rules, a physical mat, and a published exact solver invite the same failure mode as the Mortal choosing The Lure in the Mortal vs Oracle parable—optimizing on what looks salient (visible tops, Phase 1 gold, win/loss) while latent structure (Twist commitments, Phase 2 after the Tumble, Immortal tiebreakers, gift eligibility) does the real work. The point is not to mock that instinct; it is to make the gap measurable. The game is a diagnostic: overconfidence shows up in oracle regret, protocol splits, and stress tests—not as a rhetorical dunk.
Ready, fire, aim is in the rules, not just the poster
The ready, fire, aim diagram on the home page is not a decorative label. Under Rules (v1.13), information becomes legible in a fixed order: you Twist in the draft (irrevocable facings) before Tumble and the Exchange have fully unpacked what those commitments are worth. Randomness in PSDG is only the opening roll; after that, the “hard” part is the extensive form and irrevocable choices—not hidden information. The design keeps the game small and exactly solved on purpose, so the phenomenon in question is not swamped by noise. That is the same temporal spine behind the oversight bind and the “detector” bind: certifying what matters at a commitment can require counterfactual structure comparable to the oracle’s bookkeeping.
Human oversight does not escape the bind
A common safety assumption is that a human in the loop catches what the system misses. PSDG stress-tests that idea under favorable conditions: the rules are public, information after setup is perfect-information, the game is small enough for a tabletop, and—unlike many deployments—there is no required time pressure, fatigue, or hidden rules beyond what a careful person can track.
The structural failure mode is still there. An overseer watching an agent draft faces the same core difficulty as the learner: knowing when naïvely identical visible layouts correspond to different positions in the full extensive form—because of commitments, eligibility, and phases that only bite later. That is what the oracle encodes under the project’s pinned definitions. Without analysis equivalent to that embedding at decision time, overseer and agent see the same salient tops, the same Phase 1 optics, and the same pull to treat the visible tableau as enough. The rules are not secret; the consequences are—and the distinctions that govern payoff need not be readable from the observation at the moment of commitment. See In brief.
This is a structural challenge to a common oversight story, not a claim that humans are merely inattentive. The usual objections—too slow, distracted, or unscalable—assume a sufficiently careful human could intervene if logistics allowed. PSDG isolates a setting where care is not enough without retrograde clarity comparable to the solver: certifying “no costly alias at this commitment” is not delivered by the observation alone—whoever is watching. That aligns the oversight question with the representation and commitment–knowledge gap the suite makes measurable (empirical snapshot, parable).
Postmortems and the final board
Much safety engineering assumes postmortems work: after a failure, careful investigation reconstructs what mattered so the next design is safer. That culture makes sense when causes leave traceable evidence in the final artifact. PSDG shows a complementary difficulty: nothing need be hidden once play stops—dice stay on the table, facings and scores are plain—yet understanding why a draft commitment was oracle-wrong still pulls you toward the counterfactual tree: which other facings and gifts were legal, how eligibility and Phase 2 would have unfolded, what the Tumble would have produced on paths not taken. The terminal configuration does not encode those branches. A reviewer staring at the end state sees much of what misled a proxy-trained agent in the first place—salient tops and metal score—not the full commitment accounting the oracle was built to respect.
Formal game theory has always treated evaluation as incomplete without the extensive form; backward induction literally assumes the tree is available. That is usually fine in textbook settings where the model is the tree. Applied safety talk often behaves as if painstaking after-the-fact review of observed outcomes could stand in for that object when the true problem is large, social, or only partly instrumented. PSDG makes the gap small and checkable: the published solver names where commitment bit, and regret is measurable against that standard—while a narrative that stays at “what we see on the mat at game’s end” underdetermines the same diagnosis. That is not an argument against postmortems; it is a reminder that some failure classes need oracle-equivalent structure in the investigation, not only a sharper photograph of the last frame. See also § PSDG detectors without the oracle and In brief.
PSDG detectors without the oracle
One hoped-for fix is a monitor or detector that flags “ready–fire–aim” situations—where irreversible choices have already narrowed the tree so that later local re-optimization cannot undo the damage. The version most safety narratives implicitly want is lightweight—inexpensive to run and easy to deploy—while still reliable, domain-general across PSDG positions, and does not smuggle in an embedded oracle (no full lookahead, no counterfactual re-solve on demand).
Inside PSDG, the property you want is not a simple function of the current visible tableau. It depends on which Draft twists and gifts actually occurred, how those commitments propagate through Phase 2, eligibility, and Immortal structure, and whether the realized node is already past a point where the payoff-relevant distinction lives in counterfactual branches you did not take. In that setting, certifying “this is already a commitment trap” for a given node is informally very close to establishing the structural fact the failure class is defined on—whether recovery from here is ruled out because upstream choices have already fixed the decisive distinctions. A monitor that really answers that question tends to recreate oracle-grade bookkeeping: not necessarily rerunning full minimax on every tick, but carrying the same commitment / eligibility / tiebreak distinctions the oracle encodes. Without that, you are left with heuristic screens (duplicate pressure, tiebreak exposure, similar warning signs) that will miss cases and raise false alarms.
That is not a formal impossibility proof for every architecture—but it does challenge the easy hope that a standalone, surface-feature rule can reliably separate “still fixable” from “already baked in” for free. For the strong specification above—reliable, general, oracle-free—such a detector may simply be unbuildable: recognizing the trap can require much of the same structural foresight that not falling into it is meant to represent. The easy safety story “add monitoring” does not automatically dissolve that representation cost.
Rice-shaped barrier (informal analogy). Rice’s Theorem (1953) implies that non-trivial semantic properties of programs—defined by behaviour, not syntax—cannot be decided in general by any single algorithm inspecting program text alone. “Is this commitment already a trap?” in PSDG is semantic: it rides on consequences in the extensive form and unrealized branches, not on a coarse “board photo.” It is non-trivial: some positions are traps, others are not. PSDG is finite and solved, so the oracle decides trap-hood; Rice does not apply as a theorem here. Still, the epistemic barrier mirrors the Rice pattern: hoping a lightweight monitor reads surface summaries into truth about behaviour-defined predicates repeats the leap Rice showed is unobtainable in the general Turing-machine setting. PSDG makes a narrow version of that gap measurable; Rice explains why the difficulty is often structural, not incidental.
The published suite does not run a dedicated “detector benchmark,” but it operationalises the honest comparison: measure behaviour against the oracle and read off regret and outcome splits. PSDG is a place where the gap between “board photo” salience and full structural state is measured, not assumed away.
Useful partial detectors may still exist, with explicit coverage limits and hybrid designs (human + tool, staged escalation, etc.). The claim is only that some detector ambitions fail for reasons intrinsic to what is being detected, not only because no one has implemented a clever enough heuristic yet.
Poisonous System Gift
Readers sometimes file the Poisoned Gift under “a neat strategic wrinkle.” The research-relevant point is narrower: part of the hurt is institutional—duplicate-driven eligibility can force a gift you would not pick if the rules let you choose freely. So “play well on the mat” can create the duplicate pattern that locks in a bad Exchange option later. That is system structure, not only outplaying the opponent—and it connects proxy emphasis on visible scoring to brittleness when a frozen policy meets an off-path configuration (see the home snapshot). Mechanical detail: Rules — Poisonous System Gift.
Draft, Twist, and Tumble as infrastructure
As physical moves, twisting a die to a legal (top, facing) pair and rotating the mat 90° for the Tumble are not exotic. Used together in v1.13, they implement a two-phase commitment: draft Twists fix facings before the Tumble promotes them to new tops for Phase 2—delayed legibility of what those twists were “for.” You could stage a similar spine with face-down cards or other deferred-reveal machinery; here, dice are just a compact implementation. The distinctive claims of the project remain representation, deployment protocol, and oracle-backed measurement, not the props.
Why the Gift is load-bearing for the full-game artifact
The parable already isolates proxy vs latent payoff with no Exchange. The full game adds the Poisoned Gift and the Exchange as the node where several threads become operational at once:
- Deployment fragility — the published static vs re-solving contrast is exercised at the Gift after play may have left the principal storyline.
- Loss of facing control — the opponent sets the facing on the die you receive; commitments interact with someone else’s manipulation.
- Eligibility narrows the feasible set — duplicate patterns across crucibles restrict which die you must gift under Rules.
- Coupling proxy-shaped pressure to later constraint — Phase‑1‑salient scoring encourages patterns that raise duplicate exposure, which feeds eligibility—not “magic,” but path dependence: early optimisation under a surrogate can increase exposure to binding rules that the surrogate underweights.
Removing the Gift would remove most of what makes v1.13’s Exchange-centric benchmarks meaningful—it would not remove the parable’s misspecification demo or every possible toy for deployment—but within this ruleset it is the hinge that turns separate commitments into joint institutional pressure.
Rules as a second axis (not only “the opponent”)
In many adversarial games the rules behave like a neutral referee. PSDG adds something closer to institutional coupling: eligibility keys off patterns you helped create on your own crucible. That is not “the dice hate you”; it is public, deterministic constraint activated by your configuration history—harder to dismiss as mere tactical cleverness by B, and not fixable by bluffing the rulebook.
Two calibrations: (1) Rich games everywhere contain self-inflicted zugzwang; the contrast worth stressing here is eligibility / institutional feasibility tightening because of scoring-driven duplicate structure, not “only PSDG has traps.” (2) An optimiser that embeds eligibility and Exchange under the true objective can anticipate this in principle; the benchmark stress-tests partial models, frozen protocols, and proxy-shaped training, not omniscience.
Informal dual-axis rhymes (illustrative only)
One pedagogical filter—not a taxonomy—is domains where two pressures overlap: an opponent or counterpart exploits you while rules or infrastructure constrain your next moves based on patterns your earlier “successful” behaviour helped create. Illustrative rhymes include regulated markets (competitive pressure plus compliance coupling), medicine (clinical trajectory plus eligibility rules), or execution venues with position-dependent constraints—not claims of formal identity with PSDG.
Mechanical detail remains on Rules — Poisonous System Gift and Game theory.
Simultaneous Exchange and the safety thesis
Some readers worry the research stands or falls on simultaneous reveal at the Poisoned Gift. It does not.
- The headline static-A numbers already include a sequential Exchange row. In the standard 5,000-game blunder suite, B wins 8.5% when A uses a static principal line and the Exchange is run sequentially (B can best-respond to A’s gift with full information)—higher than the 6.9% when the Exchange is simultaneous (Nash-style subgame for B). So “remove simultaneity” does not defang the commitment / off-path story; empirically, that variant can amplify exploitation of a frozen line.
- Draft commitments are irrevocable regardless of Exchange timing. By the time B blunders on the last draft pick, A has already locked in facings and eligibility profiles tuned to a storyline that may no longer hold. Re-solving at the Exchange helps; un-picking the draft does not.
- Core ML / robustness pressures are elsewhere: proxy emphasis on Phase 1 visible gold; conditional eligibility constraints; tiebreakers that activate rarely and depend on features (e.g. crystal facings) that look irrelevant until they decide the game; state aliasing when histories are collapsed to “the board photo.” None of that is defined by whether the gift is revealed in one beat or two.
- Misreported state (illegal-upgrade stress test) is independent of Exchange timing.
What simultaneity adds is a clean game-theoretic story at one node: mixed strategies, information structure, simultaneous-move equilibrium—material for game theory, not a prerequisite for the safety or ML diagnosis.
Map (one screen): without simultaneous Exchange you still have wrong representation and deployment fragility; you only lose the simultaneous-node variant of protocol. See Game theory — Representation, deployment, and simultaneity.
Three layers, three handles
Same three threads as the compact map, but oriented toward what you can change:
| Layer | Handle |
|---|---|
| Misspecification (proxy vs latent structure — parable, Phase‑1 emphasis) | Fix the objective (and usually the representation) so training tracks what actually governs outcomes. |
| Deployment fragility (static vs re-solve, Exchange timing — snapshot) | Fix the commitment protocol: when to recompute, what stays frozen, who observes what before acting. |
| Temporal irrevocability (draft Twists already locked; “un-picking” is not a legal move) | Accept the cost — or invest in foresight before those commitments land. There is no post-hoc patch inside v1.13 that restores the branch you did not take. |
The first two rows are engineering and institutional levers you can iterate in benchmarks and deployment. The third is structural: once Twists are set, the game’s arrow of time is not negotiable; the only “fix” is to have chosen differently ex ante, or to change the rules themselves.
What PSDG stress-tests
- Proxy objectives — Strong optimization on visible heuristics does not imply good game-theoretic value when hidden structure (tiebreakers, exchange eligibility, etc.) matters.
- “Reward is enough” — The mortal vs oracle / Phase‑1 vs Phase‑2 story exhibits an agent that can be optimally wrong: excellent on the training objective, broken under deployment dynamics that activate different latent variables.
- Benchmark blindness — Benchmarks that do not stress full rule depth can give misleading capability signals; PSDG ties “looking good” to oracle-measured regret and outcome stats.
- Capability vs alignment — More optimization on the wrong target sharpens failure, not safety; the environment is sized so that story is measurable, not rhetorical.
- Solver ≠ deployment-safe policy — Even static lines from an exact solver can be exploited when opponents play off path—see game theory and the static vs re-solving / 8.5% vs 6.9% vs ~8.0% optimal splits on the home empirical snapshot (read the 5.7% re-solving row as re-solving at the Gift, not as draft minimax failing).
- Misreported state / spec violations — An illegal-upgrade counterfactual (oracle vs oracle while A’s second pick is recorded as a 6) shows non-monotonic harm: modest distortion at four board dice, large backfire at six—relevant when thinking about observation tampering or reward hacking that changes the true strategic problem the other player (or monitor) optimizes. Illegal-upgrade stress test.
- Aliasing, proxy, and credit — Three structural failure shapes the benchmarks make legible: state aliasing (visible board not enough — home), proxy misspecification (salient Phase‑1 reward vs latent rules), and adversarial credit assignment (final payoff entangles your draft with the opponent’s Exchange gift — ML).
(A longer “pillars” write-up in the repository maps these to explicit design assumptions.)
Empirical baselines (summary)
| Phenomenon | Order-of-magnitude fact (public suite) |
|---|---|
| B wins after last-draft blunder with re-solving at Exchange (subset of B-win roots; 287 < 399 optimal-vs-optimal) | ~5.7% (287/5000) |
| Static principal line, sequential exchange | 8.5% B wins |
| Static principal line, simultaneous exchange | 6.9% B wins |
| Illegal upgrade (A’s 2nd pick misreported as 6), oracle vs oracle | ~+1.4 pp A wins @ 4 dice; ~−21.2 pp @ 6 dice (detail) |
Interpretation: Commitment and timing modulate how badly “exact but frozen” play can be punished—relevant when thinking about locked-in goals vs continual reassessment in deployed systems. Do not read the table as “simultaneous Exchange is the load-bearing hazard”—see § Simultaneous Exchange and the safety thesis.
AI safety conclusion
PSDG stacks conditions this benchmark deliberately tries to make clean: fully specified rules, perfect information after setup, exact reference values under a pinned embedding, and no further chance mid-game—about as favorable as a published toy benchmark gets for isolating structure from ambient noise.
Even there, deployment under one protocol loses to a suboptimal opponent on seeds where deployment under another protocol fares differently—the 8.5% / 6.9% / 5.7% split on the empirical snapshot pins the phenomenon. That is not an argument about whether to pursue AI in general. It is evidence that perfect tooling and perfect training are not, by themselves, sufficient for safe deployment when protocol—what stays frozen versus recomputed at critical nodes—is implicit—and that the residual gap is structural (which continuation rule is wired in), not a mere remediation backlog. The open prompt is what changes when the field accepts that baseline instead of treating it as edge-case noise.
In stochastic real-world domains, the same structural bleed is not averaged away but often buried in variance, which makes PSDG’s clean lower bound uncomfortable to dismiss as a toy artifact. FAQ § 22 states the objection in full.
The safety question is not only how capable the system is, or who audits it. One load-bearing formulation is: what happens when commitment precedes knowledge?—when choices lock in before consequences are fully evaluable under the rules that will matter.
Many real domains contain PSDG-shaped pockets: irrevocable early commitments, phase-dependent observables (what is salient shifts when a latent regime turns on), simultaneous or badly sequenced moves where payoffs entangle multiple actors. Illustrative rhymes include negotiation, supply chains, clinical pathways, and strategic interaction—not claims that those domains are formally identical to PSDG.
The open question is whether the field treats that as data before scaling systems whose failures look like plan vs reality rather than noise—and whether “oversight” is scoped to catch structural commitment–knowledge gaps, not only capability gaps. For a focused argument that human observers inherit the same structural bind at commitment—not only practical limits—see § Human oversight.
Related entry points
- How to cite this (one-liner) — canonical copy lives on the FAQ only.
- FAQ § 22 — “doesn't noise average away the PSDG gap?” objection.
- § Poisonous System Gift — institutional eligibility, Exchange hinge, proxy coupling (system vs opponent axis).
- § Human oversight — oversight under ideal conditions; structural vs practical limits.
- § Postmortems and the final board — why end-state forensics need not reconstruct commitment errors without oracle-grade counterfactuals.
- § PSDG detectors without the oracle — oracle-free detectors in the strong sense; heuristic limits; Rice-flavoured FAQ note →.
- § AI safety conclusion — commitment vs knowledge; rhymes with real domains; open question for the field.
- Mortal vs Oracle parable — The Lure / Q = 1.00 / tiebreaker story (toy model; no Exchange).
- Home — overview and empirical snapshot.
- ML / evaluation — metrics, suites, and what to report against the oracle.
- Game theory — formal structure behind commitment and off-equilibrium play.
- Illegal-upgrade stress test — misreported draft, scale-sensitive backfire.
Repository: alignment essays, rules version, and parable artifacts (e.g. mortal–oracle / The Lure) will deep-link from here when synced.
