Attention & Transformer: how an LLM understands & predicts words

ATTENTION · TRANSFORMER SIMULATION

▸ Cutting the sentence into tokens & loading the vocabulary…

▸ Looking up an embedding for each token

▸ Producing Query · Key · Value for self-attention

▸ Computing softmax(Q·Kᵀ/√d) weights · multiple heads

▸ Calibrating temperature & the next-token sampler…

▸ Ready — Online. ✅

Model state

💬 Attention & word prediction

Tokens—

Query token—

Next guess—

Top probability—

Temperature T—

Attention heads—

Notes

An LLM like ChatGPT reads a sentence as tokens, each token ‘attends’ to others to understand context, then predicts the next token by probability. This is a simplified intuition of attention — not the full architecture.

—

Pick a ‘Scenario’ to change the view (attention links · matrix · multi-head · text generation · temperature · long-range context) · drag Temperature to see the distribution sharpen/flatten · click a concept for details

Your browser has canvas disabled.

Top-token probability & temperature over time top probabilitytemperature T