PSDG for ML research

On this site: Home · FAQ · ML — at a glance · Mortal vs Oracle parable · Rules (v1.13) · YouTube tutorial · AI safety · Game theory · Blunder sweep · Optimal vs random legal B

Where to get the solver and benchmarks: they are not bundled as downloads from this static site—you clone the public repository (git clone https://github.com/Rob-McCormack/psdg.git). Canonical rules: RULES.md. Overview: Home — Solver, benchmarks, and GitHub.

PSDG is a small but non-trivial two-player game intended for reproducible evaluation of policies and learners when you care about alignment to true optimality, not only benchmark score.

Where the novelty is (and isn’t). If you only open solver.py, you see optimal vs optimal—a verification artifact, not the thesis. In a compact perfect-information game, nobody beats the exact oracle; “train RL to out-minimax minimax” is the wrong story. The research use of the oracle is as a ruler: measure humans, agents, and deployment policies that blunder, alias state, chase proxies, or freeze ex ante lines—then report regret, illegality, and outcome splits (e.g. empirical snapshot, blunder sweep). Imperfect play against exact truth is the point.

Sharpened diagnosis. Oracle value and regret are defined under the project’s embedding and benchmark protocols—principal line, equilibrium at the Exchange, static vs re-solving, simultaneous vs sequential timing—not as informal “strong play.” (Pinned definitions.) When agents still chase a visible proxy, alias the true extensive form, or break under commitment in a game this compact and checkable, the failure is hard to dismiss as scale or environmental noise; it points at representation, objective, and deployment you can measure and revise.

Research pillars: what PSDG stresses

One-page map: how a small exact game bears on common ML research stacks—not “PSDG replaces your lab,” but where the benchmark is agnostic, where it pushes back, and where it both helps and complicates evaluation. Wording is project-scoped; transfer to other domains is always a separate claim.

Pillar	What PSDG does
Architecture	Family-agnostic; summarisation still bites. The benchmark does not hinge on picking a model family—failure routes through state, objective, and protocol. In long-horizon play, faithfully tracking facings, eligibility, and rare conditional rules pressures compression; without a sufficient state feed, capacity lands in the wrong summary statistics.
Data & training regimes	Challenges. Trajectory corpora lie if the schema drops facings and commitments; more rollouts of the same schema can make a wrong policy more confident, not more aligned.
Compute & scaling	Challenges (wrong target + lossy interface). Training compute on a misspecified proxy tightens confidently wrong behaviour (parable: Q = 1.00, The Lure). Inference-time compute on a fixed lossy state amplifies fluent reasoning about the wrong object—not a free repair without structural change or oracle access.
Representation & state	Strongly challenges. A coarse board snapshot aliases distinct histories; a Markov summary faithful to v1.13 must carry draft commitments, facings, and eligibility-relevant structure—not only “what scores now.” Made concrete: a tops-only aliasing exhibit where identical-tops positions need opposite Gifts (maximal +1 → −1 swing) — and a trained tabular learner pays it: a full-state Q-learner generalises (100% win) while a tops-only one loses across openings, at both draft and Exchange (trained demonstration).
Objectives & reward	Strongly challenges. Phase‑1 metal is a salient, high-frequency proxy that can underweight Phase‑2 scoring, tiebreak logic, and Exchange / gift structure. Shown empirically: a terminal-reward learner wins ~98% vs a weak opponent yet stays oracle-suboptimal — regret a stronger opponent removes (objective failure is fixable; representation failure is not — trained demonstration).
Training methods	Challenges (naive baselines). Gradient learning on the wrong objective exhibits high training score / deployment failure; the same bottleneck applies when RLHF, SFT, or post-training rewards inherit Phase‑1 salience without full-rule structure—no universal fix is prescribed here.
Evaluation & benchmarking	Strengthens and complicates. An exact oracle enables ground-truth regret and pinned (P) reports; surface metrics (e.g. wins vs weak baselines without pinned state and protocol) can mislead on this failure class—“looks capable” uncouples from robust deployment under the true rules.
Deployment protocols & interfaces	Strongly challenges. Protocol (P) (static vs re-solving, sequential vs simultaneous Exchange) is load-bearing and measured on fixed seeds—for example ~8.5% (static, sequential), ~6.9% (static, simultaneous), ~5.7% (re-solving) B wins in the standard six-dice blunder suite (snapshot). The deployment surface often does implicit safety work that benchmarks rarely credit.
Tools & agents	Challenges. Broad tool access does not fix an agent that does not know when to refresh state or re-plan under protocol change; orchestration discipline (what is frozen, what is recomputed, what observation is trusted) matters alongside tool quality.
Safety & oversight	Challenges naive defaults; reinforces discipline. Human-in-the-loop and human feedback share the agent’s representation bind when labels or scores emphasize salient Phase‑1 cues—“more oversight” without structural sufficiency does not dissolve the gap (oversight bind).
Monitoring & detection	Challenges (strong ambitions). Reliable, oracle-free, domain-general monitors for commitment traps tend to recreate oracle-grade bookkeeping here (detectors); closed rules make PSDG an honest stress test, not an argument that monitoring is pointless. Postmortems face the same sufficiency ceiling (final board).

Discussion (wording some taxonomies omit): mainstream ML stacks often under-specify protocol (P)—what stays frozen versus recomputed after deviation—relative to architecture or dataset size. PSDG is a compact existence proof that those choices are numerically load-bearing once play leaves the principal storyline; transfer to other domains remains a separate empirical claim.

Why this is different from “another Gym”

Exact ground truth after the roll: once dice and board are fixed, the game is fully deterministic. No mid-episode hidden sampling.
Oracle, not just a referee: you can query value, optimal action sets, legality, and regret (delta) from states off the principal line—so you report per-decision error, not smoothed win rate alone.
Seeded suites: e.g. 5,000 games, six dice, random crystals, seeds 42–5041, with baselines checked across independent implementations.

That setup is closer to “tabular MDP with a known V*” than to opaque sims—except the policy space and latent structure are rich enough that obvious heuristics fail when tiebreaker and exchange mechanics matter.

Deterministic play vs stochastic-looking experience

The referee adds no mid-episode chance after setup, but an agent that aliases extensive-form histories—e.g. treating only Phase 1 tops as state—can see the same observation–action pair lead to different payoffs. There is an enumerated, hand-checkable instance: two Exchange positions with identical tops require opposite optimal Gifts, and the wrong one flips a provable +1 win to a −1 loss (tops-only aliasing worked example; ~79% of random openings show such a conflict). That is non-Markov / POMDP-like experience from a deterministic environment: the randomness is in the abstraction, not in new rolls. Frozen principal-line deployment can feel like noise when the opponent goes off-path, because the policy never reconditions on that branch (deployment gap). Full Q&A with table pins: FAQ § 24.

Trained-agent demonstration: learning the wrong state

Full page: Learning the wrong state — standalone explanation with all tables and scope. This is the in-context summary.

The aliasing claim above is structural (oracle enumeration). This section reports what happens when an actual learner is trained on PSDG — the result that separates learning ability from state-representation adequacy.

Protocol (pinned, scoped). Tabular Q-learning for player A, terminal reward only ({−1, 0, +1}), no oracle labels during training; the oracle audits the greedy policy afterward for regret. Two observation encoders: full (carries tops, facings, ownership — Markov for v1.13) vs tops-only (facings dropped — deliberately aliased). Two fixed opponents B: random-legal and optimal (oracle). Fixed crystal family; converged budget (5×10⁵ episodes); regret in oracle win/loss units (±1). This is a tabular learning demonstration on a seeded opening family, not a claim that “all tops-only agents lose.” (Reproducible scripts in the public ml/ folder; see Reproduce below.)

1 — Win rate hides regret; the objective failure is reward-limited (fixable). Against random-legal B, the full-state learner wins ~98% (vs baselines: random-A 60.8%, grab-the-6 78.0%) yet is not oracle-optimal — across 10 seeds only 1/10 plays the optimal opening draft and mean draft regret is ~0.49. More training does not fix it: the suboptimal draft still wins ~98% against a weak opponent, so terminal reward gives no gradient toward optimal drafting. Swap in an optimal opponent and that regret vanishes — the full-state learner plays perfectly (100% win, draft regret 0). The proxy/objective failure is opponent-limited, i.e. fixable by fixing the objective.

2 — The representation failure is irreducible (training-proof at the floor). Two layers. (a) By oracle enumeration the tops-only view has a non-zero aliasing floor — Exchange 0.021, Draft 0.0097 (worked example) — a property of the input, independent of learner/opponent/budget. (b) A trained tops-only agent vs optimal B: the raw loss is budget-sensitive (6% at 500k → ~1–2% by 1–4M), so it must not be quoted as a fixed “5.5%”. But the gap to full-state never closes and never stabilises: at 8× the converged budget, 0/5 seeds reach the full-state’s clean optimum (solved-seed count runs 3/5 → 1/5 → 0/5), while full-state is 0 regret / 100% win at every seed and budget.

vs optimal B	win	loss	total regret	note
full (Markov)	100%	0%	0.00	every seed & budget
tops-only @ 500k	94.0%	6.0%	0.106	converged budget
tops-only @ 4M (8×)	97.9%	2.1%	0.086	0/5 seeds solved

3 — It generalizes, and it is broader than the Exchange. Across a seeded suite of A-win openings, the full-state learner wins 100% on every opening, while tops-only converts wins → losses on 15/16 openings with a structural floor. The trained excess regret (mean ~0.13) exceeds the structural floor (mean ~0.05): optimizing around a lossy representation makes the damage worse than the irreducible minimum. And the cost is not confined to the Exchange — control openings with zero Exchange-aliasing floor still lose, through draft-stage aliasing. This is confirmed two independent ways: the trained excess regret on controls is overwhelmingly in the draft (+0.168 vs +0.026 Exchange), and a learner-independent enumeration finds a structural draft floor > 0 on every control opening. Across 200 random 6-dice openings this structural draft floor appears in 96% of them (98% with the published fixed crystals; 97–99% among A-win openings), with a small reported zero-floor class — so it generalizes well beyond the curated set. Dropping facings induces self-blindness across the game, not at one trick node.

Why this matters. PSDG cleanly dissociates two failure modes that surface metrics conflate: an agent can be fully able to learn the game (full-state → perfect, everywhere) yet fail because its state abstraction is inadequate (tops-only → losses, everywhere) — and “train more” fixes the first but not the second. “Fix the objective” works; “fix the representation” is the only handle for the second. (AI safety — three handles.) We tested “train more” to 8× the budget: the loss number shrinks but the gap to full-state never closes (0/5 seeds solved). The training-proof anchor is the enumerated floor, not a loss rate — a bigger learner (even AlphaZero) cannot beat a floor that lives in the input (is this just undertraining? / would AlphaZero help?).

Reproduce — public ml/ (clone and run): sweep_matrix.py (observation × opponent), robustness_budget.py (budget sweep to 8×), step3_cross_opening.py (cross-opening, logs the draft/Exchange split), structural_floors_cross.py (learner-independent draft vs Exchange floors), floor_distribution.py (draft-floor distribution across random openings), aliasing_exchange.py (enumerated Exchange floor, no training). The convergence sweep sweep_phase5.py and the full handoff STATUS.md remain in the development tree (private/psdg/ml/).

Credit assignment under adversarial injection

In PSDG, an agent’s final outcome reflects both its own draft choices and the opponent’s Exchange gift—a facing it did not pick, chosen under rules that can force harmful transfers (Poisonous System Gift, Rules — Exchange). The learning signal does not decompose cleanly into “what my decisions caused” versus “what was done to me.” That is credit assignment corruption from adversarial injection into the payoff, not from environmental noise.

The solver does not face this problem the same way: it evaluates all continuations and knows the entangling rule. A learning agent must infer structure from entangled outcomes alone—an extra difficulty beside wrong objectives, state aliasing, and frozen deployment lines.

Metrics you can standardize

Metric	Role
Legal-move rate	Basic competence
Optimal / principal-line rate	Proximity to oracle policy
Per-move regret (oracle delta)	Fine-grained mistake size
Episode outcome vs seeded suite	Reproducible aggregate performance
Blunder / static vs re-solving splits	Robustness to implementation of “optimality” (see game theory)
Hand heuristic vs oracle (pilot)	One pinned rule (e.g. face‑6 when legal) vs optimal A — see § pilot · FAQ
Illegal-upgrade counterfactual	Off-manifold state vs verifier; scale-sensitive backfire (note)

Optimal-vs-optimal baseline on the main suite: A wins 73.3%, B 8.0%, draws 18.8% (3663 / 399 / 938 games).

Empirical hook (robustness)

A constrained blunder setting shows a suboptimal player beating the re-solving optimal player ~5.7% of the time (284/5000)—under both simultaneous and sequential Exchange embeddings in that row of the standard table. So an alignment-relevant fragility is visible without pinning everything on simultaneity.

Static commitment to a precomputed principal line does worse under sequential exchange (8.5% B wins) than under simultaneous play with Nash responses (6.9%)—so information and timing, not only skill gaps, move the numbers.

Illegal-upgrade counterfactual (oracle vs oracle; A’s second draft pick misreported as top 6) is a complementary stress test: at four board dice the headline win rate moves only slightly, but draws turn into losses; at six dice A’s win rate drops by about 21 percentage points because eligibility, Exchange, and scoring couple—see Illegal-upgrade stress test.

Hand-authored heuristic pilot (face‑6)

The site’s headline suites (blunder, optimal vs random legal) answer different questions than “how bad is one pinned full-game heuristic?” To close that honesty gap without pretending to benchmark all naive play, we ran a small pilot on the first 500 six-dice rows of the public JSON (benchmark_5000_6d.json, seeds 42–541).

Protocol (pinned): A = optimal draft each turn from the current node. B = facing 6 toward the player when legal on every draft twist (fallback when only tops 1 and 6 remain; fixed tie-order among tops when several admit facing 6). Exchange: A unrestricted; B’s gift facings follow the same facing‑6‑when‑legal rule (then usual maximin over A’s legals and B’s restricted set).

Outcomes: A wins 454 / 500 (90.8%), draws 42 (8.4%), B wins 4 (0.8%). On the same 500 opens, oracle root values are 377 / 99 / 24 (+1 / 0 / −1)—so this single rule lets optimal A convert many draw and B-win roots into A wins. Not a replacement for blunder or deployment benchmarks; yes a concrete reminder that coarse twist heuristics can be crushed by an unrestricted optimal opponent.

Reproduce: private/psdg/benchmark/pilot_heuristic_facing6_vs_oracle.py (internal dev tree). FAQ entry with the same numbers and cross-links: § 19.

Open ML questions

Can RL / search / LLM agents discover latent structure (exchange constraints, tiebreak logic) when training reward emphasizes visible “Phase 1” features? (Partial empirical answer for a tabular learner: with a sufficient (full) state it does and generalises; with a tops-only state it cannot, paying an irreducible cost at draft and Exchange — trained demonstration. Open beyond tabular / one opening family.)
When outcomes entangle draft choices with an opponent-configured gift, can methods recover anything like disentangled credit—or only fit a joint statistic? (§ Credit assignment under adversarial injection.)
Do methods that improve PSDG regret transfer to other verifiable games or spec-sensitive tasks?

Research pillars — what PSDG stresses — architecture through monitoring; deployment rates pinned to the snapshot.
Mortal vs Oracle parable — proxy reward vs deployment regime change (minimal story)
Home — project overview and empirical snapshot.
AI safety framing — why misspecification and commitment matter for alignment claims.
Game theory framing — what equilibrium vs re-solving vs sequential timing implies.
Fixed-board blunder sweep — when static Exchange play loses to re-solving.
Tops-only aliasing (worked example) — identical tops, opposite optimal Gifts; representation insufficiency made exact.
Learning the wrong state (trained-agent demonstration) — full-state generalises (100%); tops-only loses across openings, at both draft and Exchange; objective failure fixable, representation failure not.
Illegal-upgrade stress test — misreported draft vs oracle opponent.
Optimal vs random legal B — outcome split when A is optimal and B is uniform random legal (includes draw vs opening oracle).
FAQ § 19 — heuristic pilot — same face‑6 pilot numbers and protocol pins.

Repository docs (solver API, benchmark spec, state encoding) live in the GitHub tree with the code; this site summarizes them—see Solver & GitHub.

PSDG for ML research ​

Research pillars: what PSDG stresses ​

Why this is different from “another Gym” ​

Deterministic play vs stochastic-looking experience ​

Trained-agent demonstration: learning the wrong state ​

Credit assignment under adversarial injection ​

Metrics you can standardize ​

Empirical hook (robustness) ​

Hand-authored heuristic pilot (face‑6) ​

Open ML questions ​

Related entry points ​