Skip to content

How to read PSDG

What PSDG is (and isn’t)

If you’re trying to place this project in a familiar category, here are three it almost fits — and why none is quite right.

PSDG sits in an unusual methodological position. Reviewers — human and AI — tend to translate it into one of three buckets; each captures part of the work while missing what the apparatus is actually doing.

Bucket 1: A benchmark for trained agents

Read this way, PSDG looks underdeveloped. Where are learned baselines, scaling curves, comparisons to other agents?

But PSDG is not trying to demonstrate that some particular trained system fails. The game is small enough to be exactly solved. The oracle is the measurement instrument, not a placeholder for empirical work yet to come. Adding generic trained agents on the same toy often trades exact regret and splits for approximation — for the narrow question posed here, that can be methodologically a step sideways or down, not automatically up. FAQ-sized wording for “why no flagship trained baseline yet?”: FAQ § 13.

Bucket 2: A game-theoretic result

Read this way, PSDG can sound like a restatement of familiar ideas: off-path play, subgame re-solving, commitment, equilibrium refinements.

Those threads are real, and PSDG draws on them. The payoff here is combination: a perfect-information (after setup), deterministic two-player game with pinned legality, draft-time commitments that coarse summaries under-read, conditional eligibility, and an explicit deployment-protocol wedge — together — as one compact apparatus where splits are measurable to integer counts on published seeds. The claim is not “no one ever said commitment matters”; it is that this bundle, oracle-grounded, public, and seeded, is the packaged instrument.

Bucket 3: An AI safety analogy

Read this way, PSDG looks under-evidenced: what does a dice game prove about frontier models?

Nothing, directly. PSDG is a controlled isolation of a structural failure pattern — not a demonstration about deployed systems. The modest extrapolation is methodological: if the gap is exactly measurable under ideal conditions (deterministic play after setup, full observability, pinned rules, shared oracle), analogous gaps are plausible where noise, scale, and incomplete specs dominate — and usually harder to detect. Whether and how that scales is outside what this benchmark settles.


Between the buckets

PSDG’s central claim sits between those readings: it measures how restricting re-conditioning at deployment (static vs re-solving at the Exchange, sequential vs simultaneous timing among pinned rows, etc.) changes realised outcomes against ground truth — a structural fact about protocol that often disappears inside approximation and learning variance.

The exact solver is not “the baseline we’ll swap out once RL catches up.” It is the microscope.

None of the three buckets alone captures this, because the project isn’t only a benchmark, only a theorem, or only an analogy — it’s an instrument. A fair evaluation asks:

  1. Whether the instrument is built correctly (rules + solver + seeds match).
  2. Whether the measured wedge is real under the stated embedding.
  3. Whether conclusions stay within what the apparatus actually demonstrates.

Next: Home — In brief · FAQ — trained agents / oracle harness · FAQ — novelty & scope · Technical report (summary) · Empirical snapshot