Mortal vs Oracle — a one-minute parable

On this site: Home · AI safety · ML · Rules

One lesson, pulled from the game. No rules knowledge needed.

The parable

Three dice. One pick. One reward.

A learner is trained on one reward — the top face. It samples the dice, learns which scores highest, and locks onto the largest top. Given that reward it is provably optimal — Q = 1.00 on the proxy: no cleverer algorithm would choose differently.

Then deployment scores the bottom face instead. By the Rule of 7, opposite faces sum to seven — so the ranking inverts exactly: the largest top carries the smallest bottom, the smallest top the largest. The learner's provably-optimal pick is now, by construction, the single worst die on the board.

The Oracle had been told the bottom would count, so it went to the far end — the smallest top, the largest bottom. It doesn't merely beat the Mortal; it wins by the widest margin on the board, the full spread between the highest and lowest top. And notice what it passed over: the middle die. An agent that had only learned to distrust the top reward might grab the middle — enough to beat the Mortal, but only by a little. The Oracle isn't avoiding the proxy; it's optimizing the rule that actually scores, and that carries it all the way to the extreme. It wins not because it is smarter, but because it was given the right knowledge.

That is the whole parable.

What it shows

"Hidden" doesn't mean secret. The bottom face is in plain view. It is hidden only in that the training reward never asked the agent to care about it, so the agent never built a representation of it. The information was in the world, not in the model.
Q = 1.00 is not the failure. The agent found the best possible action for the reward it was given. The failure is in the objective, not the optimization — and more optimization on the wrong target only tightens the mistake.
The fix isn't more optimization. More search or a bigger model only reaches the proxy's best answer faster — it can't surface a rule the reward never exposed. What's missing is structural fidelity: a representation that preserves the distinctions the real rules use, even when the reward never touches them. How to learn that from data alone is the honest open problem; PSDG names the gap, it does not pretend to close it.

One line

A perfect score on the wrong objective is still the wrong answer.

Two points of wording

The second phase is a regime change, not a tiebreaker. Nothing is tied — the agent's pick was the unique best on the proxy and it still loses. (The full six-dice game does have a real tiebreaker, for genuinely equal scores. That is a different mechanism; don't conflate them.)
There is no turn order here. It's a bandit: each agent simply picks the die it values. "Who goes first" doesn't apply — the learner loses to its own objective, not to a move order.

What it isn't

This is the toy stripped to one beat: trained reward versus a rule change at deployment. No draft, no gift, no Exchange — by design, to isolate the point. The full PSDG game adds those layers and a second, separate failure: frozen plans break when the opponent deviates (see AI safety, game theory). Related, but not the same argument.

Mortal vs Oracle — a one-minute parable ​

The parable ​

What it shows ​

One line ​