The model's embedding matrix is a giant table — one row per token in its vocabulary,
and each row is a vector of ? numbers. When the tokenizer turns your
prompt into integer ids, the model uses each id as a row index into this table to
pull out one row. That's literally it — no math yet. Just a lookup.
Embedding matrix shape: ? ·
? at fp32 ·
? rows total · we showed ? rows above
Section 2 · The neighborhood (PCA · 2D)
Plotting ? tokens after squashing the ?-dim
embedding space down to 2 dimensions via PCA. The two axes are the directions of greatest
spread across the full vocabulary (PC1 = ?%, PC2 = ?%
of total variance). Tokens with similar meanings often end up near each other —
this is learned structure, not designed.