llm-basics

How an LLM turns a prompt
into the next token.

Every animated demo on this page is a real forward pass through a small open-weight model (distilgpt2 or TinyLlama). All the intermediates are captured — tokens, embeddings, residual streams, attention patterns, logits, probabilities — and rendered as a step-by-step slideshow. No magic, no hand-waved arrows.

Built with HuggingFace transformers directly so every layer is observable.

1 · Input 2 · Tokenize 3 · Embed 4 · Layers 5 · Logits 6 · Softmax 7 · Sample 8 · Loop 9 · Done
Animated walk-throughs
9 steps · 6 transformer layers

distilgpt2 · sequence completion

prompt: "January, February, March,"

A 6-layer GPT-2 base model. Watch the residual stream evolve layer-by-layer and the LM head pick "April" out of 50,257 candidates by similarity.

9 steps · 22 transformer layers

TinyLlama · chat Q&A

prompt: "Name three primary colors."

A chat-tuned Llama with RoPE positional encoding and untied LM-head weights. Same pipeline — different scale, different architectural choices.

Inside one transformer block
NEW · 6 substeps · 1 layer · 1 head

distilgpt2 · attention head + FFN

layer 3 / head 0 — prompt: "January, February, March,"

Zoom into one transformer block. Watch a real attention head compute Q · K‑transpose, morph through softmax, then mix V. Then the FFN expands 768 → 3072, applies GELU, contracts back. Real numbers, animated.

NEW · 6 substeps · 1 layer · 1 head

TinyLlama · attention head + FFN (SwiGLU)

layer 11 / head 0 — prompt: "Name three primary colors."

Same deep-dive on TinyLlama. Bigger numbers (2048 → 5632 SwiGLU FFN), grouped-query attention (32 query heads, 4 KV heads), the SiLU activation curve. RoPE rotation explained for the chosen head.

Embedding-space explorer
PCA scatter · 50,257 tokens

distilgpt2 — semantic geometry

vocab × hidden = 50,257 × 768

2D projection of every vocab token's embedding row. Months cluster, days cluster, animals cluster — without any explicit semantics, just learned co-occurrence.

PCA scatter · 32,000 tokens

TinyLlama — semantic geometry

vocab × hidden = 32,000 × 2,048

Same idea on a different model. The clusters look different — TinyLlama uses sentencepiece + a separately-trained vocabulary.