RecyclingRobot · replay

event-log playback ·
Featured robot
Policy
State
Cumulative reward
0
Rescues
0
Turn
0
Follow: Bold #2 — reckless Bold #5 — lucky Timid #7 — cautious

The MDP — Sutton & Barto, Fig. 3.3

The token is the followed robot. Searching drops it onto an action node; the environment then resolves the outcome along a branch (α/β), recharge climbs back to high.

Cumulative reward — Bold vs Timid

One line per robot, drawn as it is logged. Bold runs hot and high-variance — the dips are rescues (−15); Timid hugs a steady, safe slope.

Bold policy Timid policy high battery low battery rescue (−15) space play · step