Skip to content

Mortal vs Oracle — a one-minute parable

On this site: Home · Q-learning / bandit demo · AI safety · ML · Rules

No prior game knowledge required. This page is the story behind the benchmarks; the full tabletop rules are optional context.


The Parable — What It Does and What It Doesn't

The parable is not the game. It is one lesson extracted from the game.

Three dice. One pick. One tiebreaker. That's all.

The training phase rewards the top face. The Mortal picks the highest top. It scores perfectly. Q = 1.00. It has solved the task it was given.

Then the tiebreaker fires. Now the bottom face decides. The Rule of 7 means the highest top has the lowest bottom. The Mortal's perfect pick is now the worst pick. It loses.

That is the entire parable. Nothing else.


What "hidden" means here

The bottom face is not secret. Anyone can look at a die. "Hidden" means the training reward never asked the agent to care about it. So the agent never built an internal representation of it. The information exists in the world. It does not exist in the agent's model of the world.

That is the only kind of "hidden" that matters.


What Q = 1.00 means

It means the agent found the best possible action for the reward it was given. It is provably optimal. There is no better algorithm, no smarter search, no deeper network that would choose differently — because the reward says "pick the highest top" and the agent picks the highest top.

The failure is not in the optimization. The failure is in the objective.

Optimization made the failure sharper, not safer. The Mortal was not "too dumb"; it was exactly right about the wrong objective.


Two separate claims, not one

Claim 1 — the parable: An agent can be optimal at the trained objective and still fail catastrophically when a latent rule activates. No Exchange. No simultaneous moves. No draft. Just proxy reward versus true structure.

Claim 2 — the full game: When you add the draft, the Poisoned Gift, the Exchange, and the commitment structure of PSDG, a second failure appears: frozen plans break when opponents deviate. This is about deployment, protocol, and timing — see the empirical snapshot, AI safety, and game theory.

The parable proves Claim 1 in isolation. The full game adds Claim 2 on top. They are related but they are not the same argument.


One sentence

The parable shows that a perfect score on the wrong objective is still the wrong answer.


The Oracle wins not because it is smarter but because it was given the right knowledge. We do not yet know how to get that knowledge from data alone—that is the honest open problem. PSDG diagnoses that gap; it does not pretend to close it.

Framing: the lesson is not "today's algorithms are too weak." Stacking PPO, then DQN, then model-based RL would suggest the fix is more machinery. Here the fix would be structural fidelity—a representation that preserves the distinctions the real rules use—even when the proxy reward never touches them.

In the story above, the tempting high top is The Lure; the Oracle deliberately picks another arm so that when the tiebreaker scores bottoms, it wins.


Commit first, evaluate later (full PSDG)

The bandit toy strips the game to one beat: training reward vs regime change at deployment. The tabletop game makes the same shape explicit: rough contrast with chess (not a theorem)—chess is closer to act, then read the new position before the next deep commitment; PSDG is closer to lock in Twists and gifts, then let Phase 2 and tiebreakers unpack what those commitments meant. Commitment precedes full evaluation because the rules stage when information counts, not because play is careless.

Any agent or human who treats "what I see now" as the whole story will alias positions the rules distinguish. Metaphor: both players aim at each other, but they are also both up against the machinery of the rules—the spec punishes a missing world model. Same gloss with a bit more rules context: Home — Core ideas.


Not the full PSDG — and that is the point

This parable is a tiny bandit + tiebreaker (mortal-vs-oracle-parable.py / .html in the repository). It has no draft, no Poisoned Gift, and no simultaneous (or sequential) Exchange—by design, to isolate proxy reward vs latent structure that only matters when a conditional phase activates.

That absence is a feature. The "perfect score, still lose" story does not depend on simultaneous Nash or Exchange timing. Full PSDG adds those layers—and splits like 5.7% / 8.5% / 6.9%—on top of the same lesson: what you optimize ≠ what determines the outcome under deployment.


Thesis in one line

Higher Q on the proxy is not the same as a correct world model. Convergence does not imply the agent has learned the rules that actually govern the outcome when the environment shifts phase. More optimization on the wrong target tightens the mistake.

We use world-model accuracy / structural fidelity here—not "causality" in the Pearl sense. The bottom face is fixed by the Rule of 7; the Mortal never needed it for training, so it never built it.


Relation to the rest of the site

LayerWhat it adds
This parableMisspecification with minimal machinery; no Exchange.
Q-learning / bandit demoRunnable code, algorithm note, repo paths.
AI safetyRobustness, commitment, deployment; simultaneous vs sequential (separate thread).
MLMetrics, regret, oracle evaluation.
Game theoryEquilibrium and information structure.
Empirical snapshotSeeded counts on the full game.

Run the demo (repository)

Published walkthrough (commands + thesis): Mortal vs Oracle — Q-learning / bandit demo.

Development tree only: Same caveat as on that page—the psdg public repo does not contain private/psdg/. You need the internal checkout that builds this site.

Terminal (from the root of that checkout): python3 private/psdg/qlearning/mortal-vs-oracle-parable.py

Browser: open private/psdg/qlearning/mortal-vs-oracle-parable.html from that checkout (interactive).

Companion README: private/psdg/qlearning/mortal-vs-oracle-parable.md

Algorithm note: The Mortal uses bandit-style averaging on visible arms, not full TD Q-learning with bootstrapping—the qualitative point is identical: optimal at the stated reward, wrong when the true scoring rule activates variables the reward ignored.