Mortal vs Oracle — Q-learning / bandit demo

On this site: Home · Parable (story) · AI safety · ML · Rules

This page is the technical companion to the one-minute parable: runnable code, algorithm note, and why “more Q-learning” does not fix the failure mode.

What this demo is

The repository ships a minimal three-armed bandit plus a deployment regime change: training only rewards top faces; a tiebreaker later scores bottom faces (via the Rule of 7 on d6). The Mortal agent’s value estimate converges to optimality for the training reward—then loses catastrophically when the true scoring rule uses structure the reward never emphasized.

Algorithm note: The reference Mortal uses bandit-style incremental averaging on visible arms, not full TD Q-learning with bootstrapping and (\gamma). The qualitative conclusion is the same: better optimization on the proxy tightens the wrong policy. Full DQN / PPO on the same reward would still miss the latent variable until the objective or representation changes—see the companion design notes when you run the demo below (mortal-vs-oracle-parable.md in the development tree; not in the public psdg repo).

Thesis (one screen)

Higher Q on the proxy is not the same as a correct world model. Convergence on the stated reward does not imply the agent has learned the rules that govern the outcome when the environment shifts phase.

The Oracle wins because it encodes Rule of 7 and the tiebreaker—not because the Mortal is “too dumb.” The open problem is structural fidelity (what to represent), not “more episodes on the same score.”

Run the demo (from a repository checkout)

Development tree only: The public psdg artifact has no private/ directory and does not ship this demo. The commands below assume the internal repository whose layout includes private/psdg/qlearning/ at the root of that checkout. This site does not host those files as downloads. Overview: Home — Solver, benchmarks, and GitHub.

Terminal (from the root of that development checkout):

bash

python3 private/psdg/qlearning/mortal-vs-oracle-parable.py

Browser: from the same checkout, open private/psdg/qlearning/mortal-vs-oracle-parable.html locally.

Design notes (Markdown): private/psdg/qlearning/mortal-vs-oracle-parable.md — same tree as above.

How this relates to full PSDG

Piece	Role
This demo	Smallest “optimal on proxy → fail at deployment” story; no draft, Exchange, or simultaneity.
Parable	Same story in prose for a general audience.
AI safety	Benchmarks, commitment, and protocol detail on the full game.
Home — Ready, fire, aim	How commit-before-evaluation shows up in the tabletop rules.

The full game adds Twist / Exchange / Phase 2 layers on top of the same lesson: what you see first is not everything that will count.

Mortal vs Oracle — Q-learning / bandit demo ​

What this demo is ​

Thesis (one screen) ​

Run the demo (from a repository checkout) ​

How this relates to full PSDG ​

Mortal vs Oracle — Q-learning / bandit demo

What this demo is

Thesis (one screen)

Run the demo (from a repository checkout)

How this relates to full PSDG