REINFORCEMENT LEARNING SIMULATION

▸ Building the gridworld & placing goal / traps…
▸ Initializing the Q value table to 0
▸ Loading the ε-greedy policy (explore ↔ exploit)
▸ Setting the Bellman update: Q ← Q + α[r + γ·maxQ′ − Q]
▸ Seeding the deterministic RNG (mulberry32)…
▸ Ready — Online. ✅
0%
⌂ Mind & Machine

Simulation room Reinforcement learning

Reinforcement Learning · Q-learning
Online
trial & error · reward · optimal policy
Learning progress
🎮 Simple grid
Episode
Episode reward
Steps this episode
Exploration ε
Success rate
Best value
Notes
Reinforcement learning: an agent acts in an environment, gets a reward then adjusts to maximize cumulative reward. No one teaches the right move — it learns by trial & error over many episodes. This is how AI plays Go (AlphaGo) & Atari games.
Pick a "Scenario" to change the environment (traps · slippery · cliff…) · 🔀 new grid · ↺ relearn from scratch · click a structure/concept for details · bright cell = high value, arrow = best move.
Your browser has canvas disabled.
Reward & exploration ε per episode reward (smoothed)exploration ε