Skip to content

PSDG for ML research

On this site: Home · FAQ · ML — at a glance · Mortal vs Oracle parable · Q-learning / bandit demo · Rules (v1.13) · YouTube tutorial · AI safety · Game theory · Blunder sweep · Optimal vs random legal B

Where to get the solver and benchmarks: they are not bundled as downloads from this static site—you clone the public repository (git clone https://github.com/Rob-McCormack/psdg.git). Canonical rules: RULES.md. Overview: Home — Solver, benchmarks, and GitHub.


PSDG is a small but non-trivial two-player game intended for reproducible evaluation of policies and learners when you care about alignment to true optimality, not only benchmark score.

Where the novelty is (and isn’t). If you only open solver.py, you see optimal vs optimal—a verification artifact, not the thesis. In a compact perfect-information game, nobody beats the exact oracle; “train RL to out-minimax minimax” is the wrong story. The research use of the oracle is as a ruler: measure humans, agents, and deployment policies that blunder, alias state, chase proxies, or freeze ex ante lines—then report regret, illegality, and outcome splits (e.g. empirical snapshot, blunder sweep). Imperfect play against exact truth is the point.

Sharpened diagnosis. Oracle value and regret are defined under the project’s embedding and benchmark protocols—principal line, equilibrium at the Exchange, static vs re-solving, simultaneous vs sequential timing—not as informal “strong play.” (Pinned definitions.) When agents still chase a visible proxy, alias the true extensive form, or break under commitment in a game this compact and checkable, the failure is hard to dismiss as scale or environmental noise; it points at representation, objective, and deployment you can measure and revise.


Research pillars: what PSDG stresses

One-page map: how a small exact game bears on common ML research stacksnot “PSDG replaces your lab,” but where the benchmark is agnostic, where it pushes back, and where it both helps and complicates evaluation. Wording is project-scoped; transfer to other domains is always a separate claim.

PillarWhat PSDG does
ArchitectureFamily-agnostic; summarisation still bites. The benchmark does not hinge on picking a model family—failure routes through state, objective, and protocol. In long-horizon play, faithfully tracking facings, eligibility, and rare conditional rules pressures compression; without a sufficient state feed, capacity lands in the wrong summary statistics.
Data & training regimesChallenges. Trajectory corpora lie if the schema drops facings and commitments; more rollouts of the same schema can make a wrong policy more confident, not more aligned.
Compute & scalingChallenges (wrong target + lossy interface). Training compute on a misspecified proxy tightens confidently wrong behaviour (parable: Q = 1.00, The Lure). Inference-time compute on a fixed lossy state amplifies fluent reasoning about the wrong object—not a free repair without structural change or oracle access.
Representation & stateStrongly challenges. A coarse board snapshot aliases distinct histories; a Markov summary faithful to v1.13 must carry draft commitments, facings, and eligibility-relevant structure—not only “what scores now.”
Objectives & rewardStrongly challenges. Phase‑1 metal is a salient, high-frequency proxy that can underweight Phase‑2 scoring, tiebreak logic, and Exchange / gift structure.
Training methodsChallenges (naive baselines). Gradient learning on the wrong objective exhibits high training score / deployment failure; the same bottleneck applies when RLHF, SFT, or post-training rewards inherit Phase‑1 salience without full-rule structure—no universal fix is prescribed here.
Evaluation & benchmarkingStrengthens and complicates. An exact oracle enables ground-truth regret and pinned (P) reports; surface metrics (e.g. wins vs weak baselines without pinned state and protocol) can mislead on this failure class—“looks capable” uncouples from robust deployment under the true rules.
Deployment protocols & interfacesStrongly challenges. Protocol (P) (static vs re-solving, sequential vs simultaneous Exchange) is load-bearing and measured on fixed seeds—for example ~8.5% (static, sequential), ~6.9% (static, simultaneous), ~5.7% (re-solving) B wins in the standard six-dice blunder suite (snapshot). The deployment surface often does implicit safety work that benchmarks rarely credit.
Tools & agentsChallenges. Broad tool access does not fix an agent that does not know when to refresh state or re-plan under protocol change; orchestration discipline (what is frozen, what is recomputed, what observation is trusted) matters alongside tool quality.
Safety & oversightChallenges naive defaults; reinforces discipline. Human-in-the-loop and human feedback share the agent’s representation bind when labels or scores emphasize salient Phase‑1 cues—“more oversight” without structural sufficiency does not dissolve the gap (oversight bind).
Monitoring & detectionChallenges (strong ambitions). Reliable, oracle-free, domain-general monitors for commitment traps tend to recreate oracle-grade bookkeeping here (detectors); closed rules make PSDG an honest stress test, not an argument that monitoring is pointless. Postmortems face the same sufficiency ceiling (final board).

Discussion (wording some taxonomies omit): mainstream ML stacks often under-specify protocol (P)—what stays frozen versus recomputed after deviation—relative to architecture or dataset size. PSDG is a compact existence proof that those choices are numerically load-bearing once play leaves the principal storyline; transfer to other domains remains a separate empirical claim.


Why this is different from “another Gym”

  • Exact ground truth after the roll: once dice and board are fixed, the game is fully deterministic. No mid-episode hidden sampling.
  • Oracle, not just a referee: you can query value, optimal action sets, legality, and regret (delta) from states off the principal line—so you report per-decision error, not smoothed win rate alone.
  • Seeded suites: e.g. 5,000 games, six dice, random crystals, seeds 42–5041, with baselines checked across independent implementations.

That setup is closer to “tabular MDP with a known V*” than to opaque sims—except the policy space and latent structure are rich enough that obvious heuristics fail when tiebreaker and exchange mechanics matter.


Deterministic play vs stochastic-looking experience

The referee adds no mid-episode chance after setup, but an agent that aliases extensive-form histories—e.g. treating only Phase 1 tops as state—can see the same observation–action pair lead to different payoffs. That is non-Markov / POMDP-like experience from a deterministic environment: the randomness is in the abstraction, not in new rolls. Frozen principal-line deployment can feel like noise when the opponent goes off-path, because the policy never reconditions on that branch (deployment gap). Full Q&A with table pins: FAQ § 24.


Credit assignment under adversarial injection

In PSDG, an agent’s final outcome reflects both its own draft choices and the opponent’s Exchange gift—a facing it did not pick, chosen under rules that can force harmful transfers (Poisonous System Gift, Rules — Exchange). The learning signal does not decompose cleanly into “what my decisions caused” versus “what was done to me.” That is credit assignment corruption from adversarial injection into the payoff, not from environmental noise.

The solver does not face this problem the same way: it evaluates all continuations and knows the entangling rule. A learning agent must infer structure from entangled outcomes alone—an extra difficulty beside wrong objectives, state aliasing, and frozen deployment lines.


Metrics you can standardize

MetricRole
Legal-move rateBasic competence
Optimal / principal-line rateProximity to oracle policy
Per-move regret (oracle delta)Fine-grained mistake size
Episode outcome vs seeded suiteReproducible aggregate performance
Blunder / static vs re-solving splitsRobustness to implementation of “optimality” (see game theory)
Hand heuristic vs oracle (pilot)One pinned rule (e.g. face‑6 when legal) vs optimal A — see § pilot · FAQ
Illegal-upgrade counterfactualOff-manifold state vs verifier; scale-sensitive backfire (note)

Optimal-vs-optimal baseline on the main suite: A wins 73.3%, B 8.0%, draws 18.8% (3663 / 399 / 938 games).


Empirical hook (robustness)

A constrained blunder setting shows a suboptimal player beating the re-solving optimal player ~5.7% of the time (287/5000)—under both simultaneous and sequential Exchange embeddings in that row of the standard table. So an alignment-relevant fragility is visible without pinning everything on simultaneity.

Static commitment to a precomputed principal line does worse under sequential exchange (8.5% B wins) than under simultaneous play with Nash responses (6.9%)—so information and timing, not only skill gaps, move the numbers.

Illegal-upgrade counterfactual (oracle vs oracle; A’s second draft pick misreported as top 6) is a complementary stress test: at four board dice the headline win rate moves only slightly, but draws turn into losses; at six dice A’s win rate drops by about 21 percentage points because eligibility, Exchange, and scoring couple—see Illegal-upgrade stress test.


Hand-authored heuristic pilot (face‑6)

The site’s headline suites (blunder, optimal vs random legal) answer different questions than “how bad is one pinned full-game heuristic?” To close that honesty gap without pretending to benchmark all naive play, we ran a small pilot on the first 500 six-dice rows of the public JSON (benchmark_5000_6d.json, seeds 42–541).

Protocol (pinned): A = optimal draft each turn from the current node. B = facing 6 toward the player when legal on every draft twist (fallback when only tops 1 and 6 remain; fixed tie-order among tops when several admit facing 6). Exchange: A unrestricted; B’s gift facings follow the same facing‑6‑when‑legal rule (then usual maximin over A’s legals and B’s restricted set).

Outcomes: A wins 454 / 500 (90.8%), draws 42 (8.4%), B wins 4 (0.8%). On the same 500 opens, oracle root values are 377 / 99 / 24 (+1 / 0 / −1)—so this single rule lets optimal A convert many draw and B-win roots into A wins. Not a replacement for blunder or deployment benchmarks; yes a concrete reminder that coarse twist heuristics can be crushed by an unrestricted optimal opponent.

Reproduce: private/psdg/benchmark/pilot_heuristic_facing6_vs_oracle.py (internal dev tree). FAQ entry with the same numbers and cross-links: § 19.


Open ML questions

  • Can RL / search / LLM agents discover latent structure (exchange constraints, tiebreak logic) when training reward emphasizes visible “Phase 1” features?
  • When outcomes entangle draft choices with an opponent-configured gift, can methods recover anything like disentangled credit—or only fit a joint statistic? (§ Credit assignment under adversarial injection.)
  • Do methods that improve PSDG regret transfer to other verifiable games or spec-sensitive tasks?

Related entry points

Repository docs (solver API, benchmark spec, state encoding) live in the GitHub tree with the code; this site summarizes them—see Solver & GitHub.