LLM · Transformers · Inference · 2025

LLM Inference — From Tokens to Claude

How does typing a prompt lead to an answer? This guide walks through every step — tokenization, attention, KV cache, sampling — from raw text to real-time token generation. No assumptions. All interactive.

← Learning
0
The Big Picture — What Actually Happens

One sentence: an LLM is a function that takes a sequence of tokens and outputs a probability distribution over the next token.

That's it. Everything else — attention, layers, KV cache, sampling — is either how that function is built or how it's run efficiently. When you type a message to Claude, here's what happens:

End-to-End Inference Pipeline
1
Tokenize

Text → integer IDs. "Hello world" → [9906, 1917]. No model yet.

2
Embed

Each token ID → a dense vector of 4096–8192 floats. Lookup table.

3
N × Transformer

Stack of 32–128 identical blocks, each with attention + FFN. Runs on GPU.

4
Sample

Final vector → logits over vocab → softmax → pick next token.

5
Repeat

Append new token to context. Run steps 3–4 again. Stop at EOS.

1
Tokenization — Text Is Not What the Model Sees

The model never sees characters or words. It sees integers.

A tokenizer converts raw text into a sequence of integer IDs from a fixed vocabulary — typically 32k–128k entries. Modern LLMs use Byte-Pair Encoding (BPE) or SentencePiece, which learn to split text into subword units that balance coverage and efficiency.

How BPE Works

Starting from individual bytes/characters, BPE repeatedly finds the most frequent adjacent pair and merges it into a new token:

# Initial character split: "tokenization" → [t,o,k,e,n,i,z,a,t,i,o,n] # After BPE merges (GPT-4 tokenizer): → ["token", "ization"] → [9263, 2065] # 2 tokens, not 12

Common English words → 1 token. Rare words/names → 2–4 tokens. Code → varies. Whitespace is baked into tokens (e.g. Ġhellohello).

Why It Matters for Inference

  • Context length is measured in tokens, not characters.
  • A 200,000-token context window ≈ 150,000 words ≈ a full novel.
  • Pricing, latency, and memory all scale with token count.
  • Non-English text is less efficient: e.g. Chinese characters ≈ 1.5–3 tokens each in GPT-4o tokenizer.
  • Numbers tokenize character-by-character in many models — "12345" = 5 tokens, limiting math.
$$\text{tokens} \approx \frac{\text{words}}{0.75} \quad (\text{rough English rule of thumb})$$
Interactive Tokenizer — type to see tokens
Tokenized output (each color = one token):
Token count
Characters
Chars / token
Vocab size
~100K
2
Embeddings — Integers → High-Dimensional Vectors

An embedding is just a learned lookup table.

The model has a matrix $W_E \in \mathbb{R}^{|V| \times d}$ where $|V|$ is vocab size and $d$ is the model dimension (e.g. 4096). Looking up token $i$ just fetches row $i$. There's nothing magical — it's a trainable array that gets updated via gradient descent until semantically similar tokens have similar vectors.

$$\mathbf{x}_i = W_E[\text{token}_i] \in \mathbb{R}^d \quad \text{(embedding lookup for token } i\text{)}$$

Positional Encoding — Where Am I In The Sequence?

Attention has no built-in sense of order. Without positional encoding, "dog bites man" = "man bites dog." Two main approaches:

Sinusoidal (original Transformer):
$$\text{PE}(pos, 2i) = \sin\!\left(\frac{pos}{10000^{2i/d}}\right), \quad \text{PE}(pos, 2i+1) = \cos\!\left(\frac{pos}{10000^{2i/d}}\right)$$

Fixed, not learned. Different frequencies encode different scales. Can generalize to unseen lengths.

RoPE — Rotary Position Embedding (LLaMA, Claude, GPT-4):
$$\mathbf{q}_m \cdot \mathbf{k}_n = \text{function only of } (m-n)$$

Encodes relative distance by rotating Q and K vectors. Better extrapolation to long contexts. Used in virtually all modern LLMs.

Input to Each Layer

The full input to the first transformer layer is:

$$\mathbf{h}_i^{(0)} = W_E[\text{token}_i] + \text{PE}(i) \in \mathbb{R}^d$$

For a sequence of $n$ tokens, this gives a matrix $H^{(0)} \in \mathbb{R}^{n \times d}$. This flows through all $L$ transformer layers.

# Example: Claude 3.5 Sonnet approximate dims vocab_size = 128_256 d_model = 8_192 # embedding dimension n_heads = 64 # attention heads d_head = 128 # d_model / n_heads n_layers = 96 # transformer blocks max_context = 200_000 # tokens
3
Attention — How Tokens Talk to Each Other

Attention lets every token look at every other token and decide how much to "attend" to each.

This is the key insight of the Transformer: long-range dependencies are handled directly — a token at position 1 can directly influence a token at position 10,000. In an RNN, information had to travel 10,000 steps.

Scaled Dot-Product Attention — Step by Step

Given the hidden state matrix $H \in \mathbb{R}^{n \times d}$, three projections create Q (queries), K (keys), and V (values):

$$Q = H W_Q, \quad K = H W_K, \quad V = H W_V \qquad W_Q, W_K, W_V \in \mathbb{R}^{d \times d_k}$$
Q
Query — "What am I looking for?"

Each token projects to a query vector: its "search intent." The current token asks: which other tokens are relevant to me?

K
Key — "What do I contain?"

Each token projects to a key vector: its "content label." Matching query against keys determines relevance scores.

V
Value — "What do I contribute?"

Each token projects to a value vector: the actual information to be passed. High-attention tokens contribute more.

$$\text{Attention}(Q,K,V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Why divide by √d_k?

Without the scaling, the dot products $QK^T$ grow large in magnitude as $d_k$ increases — because they're sums of $d_k$ terms. Large values push softmax into extremely peaked (near-one-hot) distributions where gradients vanish. Dividing by $\sqrt{d_k}$ keeps the variance at ~1 regardless of dimension.

Interactive Attention Heatmap — click a token to see what it attends to
Click any token in the sentence to see its attention pattern. Each head learns different relationships — syntactic, semantic, coreference, etc.

Multi-Head Attention — Run it H times in parallel

Instead of one attention function, use $H$ separate heads with smaller dimension $d_k = d/H$. Each head can specialize — one might attend to syntax, another to coreference, another to proximity. Outputs are concatenated and projected:

$$\text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_H)\,W_O$$

where $\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$. This is where most of the model's expressiveness comes from.

Causal (Masked) Attention for Inference

During training, the model predicts all positions in parallel. To prevent token $i$ from "seeing" future tokens $j > i$, a causal mask sets future attention scores to $-\infty$ before softmax:

$$\text{Attention}_{ij} = \begin{cases} \text{softmax}\!\left(\frac{q_i \cdot k_j}{\sqrt{d_k}}\right) & j \leq i \\ 0 & j > i \end{cases}$$

During inference (generating one token at a time), this is enforced naturally — future tokens don't exist yet.

4
The Transformer Block — The Repeating Unit

Every layer applies the same pattern: Norm → Attention → Residual → Norm → FFN → Residual

Single Transformer Block — data flow diagram

Layer Norm — Why Before Attention?

Modern LLMs use Pre-LayerNorm (RMS Norm in LLaMA/Claude): normalize the input before attention, not after. This is more stable during training at large scale.

$$\text{RMSNorm}(x) = \frac{x}{\sqrt{\frac{1}{d}\sum_i x_i^2 + \epsilon}} \cdot \gamma$$

$\gamma$ is a learned per-dimension scale. Unlike LayerNorm, RMSNorm skips the mean subtraction — saving ~20% compute with no quality loss.

Feed-Forward Network — Where "Knowledge" Is Stored

After attention (which mixes tokens), a per-token FFN re-processes each position independently. Modern models use SwiGLU:

$$\text{FFN}(x) = \big(\sigma(x W_1) \odot (x W_3)\big) W_2$$

The FFN is typically 4× wider than the embedding dimension: if $d=4096$, FFN has $d_{ff}=16384$ neurons. Research shows "factual recall" is strongly associated with FFN weights — they act as key-value memories.

This is where "Paris is the capital of France" lives.

Residual Connections — The Secret to Deep Training

Every sub-layer (attention + FFN) wraps its output in a residual stream: $\mathbf{h}' = \mathbf{h} + \text{SubLayer}(\text{Norm}(\mathbf{h}))$. This lets gradients flow directly backward through all layers, enabling training of 100+ layer networks.

Residual stream intuition

Think of the residual stream as a "shared whiteboard" that every layer can read from and write to. Early layers write syntactic features, middle layers write semantic features, late layers write task-specific features. Each layer adds incremental refinements rather than replacing the representation.

5
Autoregressive Generation — One Token at a Time

The model generates text by running inference in a loop — each new token becomes part of the input for the next step.

There are two distinct phases:

P
Prefill Phase

Process the entire prompt in parallel. All $n$ input tokens run through all $L$ layers simultaneously using the full attention matrix. Cost: $O(n^2 \cdot L)$. This is fast because GPUs excel at parallel matrix multiplication. Fills the KV cache.

D
Decode Phase

Generate tokens one at a time. Each new token requires one forward pass through all $L$ layers. Cost per token: $O(n \cdot L)$ using the KV cache. This is memory-bandwidth bound on GPU — the bottleneck for long generations.

Autoregressive Generation — step through token by token
Each step: the last token in context is the query. The model produces logits over the full vocabulary. We sample one token and append it.
6
KV Cache — The Engineering That Makes Inference Practical

Without the KV cache, every decode step would recompute all keys and values from scratch — O(n²) per token instead of O(n).

The KV cache stores the Key and Value matrices from the attention computation for all previously processed tokens. Since token $i$'s keys and values never change once computed (causal masking means future tokens don't affect past representations), they can be reused indefinitely.

KV Cache — compute vs. cache comparison
Orange = compute (matrix multiply). Green = read from cache (memory bandwidth only). Without cache, every step recomputes the entire context.

KV Cache Memory

For each layer, storing one token's KV requires $2 \times d_k \times H$ floats per layer. For a 70B model with 200K context:

$$\text{KV size} = 2 \times d_k \times H \times L \times n_\text{tokens} \times \text{bytes}$$
# LLaMA-3 70B KV cache estimate n_layers = 80 n_kv_heads = 8 # GQA (grouped query) d_head = 128 seq_len = 128_000 bytes_per = 2 # bfloat16 kv_gb = 2*n_layers*n_kv_heads*d_head*seq_len*bytes_per / 1e9 # ≈ 42 GB — nearly as big as the model weights!

GQA — Grouped Query Attention

Modern LLMs use Grouped Query Attention (GQA) or Multi-Query Attention (MQA) to shrink KV cache size. Instead of $H$ separate K/V heads, multiple query heads share one K/V head:

MethodK/V HeadsKV CacheQuality
MHAH = Q heads100%Best
GQAH/G groups~13%≈ MHA
MQA1 sharedH× lessSlight drop

Claude 3, LLaMA 3, Gemini all use GQA. It enables 8× larger context with the same memory.

7
Sampling — Turning Logits into Tokens

The model outputs a vector of 100K+ raw scores (logits). How you convert those to the next token is a design choice with huge impact on quality and diversity.

Sampling Strategies — interactive probability distribution
Watch how temperature sharpens/flattens the distribution, top-p cuts off the tail, and top-k retains only the highest-k candidates.

Temperature Scaling

Divide logits by temperature $T$ before softmax:

$$p_i = \frac{e^{z_i / T}}{\sum_j e^{z_j / T}}$$
  • T → 0: argmax (greedy). Always picks most likely token. Deterministic but repetitive.
  • T = 1: unmodified model probabilities.
  • T > 1: more uniform, more random, more "creative." Can generate nonsense.
  • Claude uses T ≈ 1 for most tasks; lower for code/math.

Top-p and Top-k Filtering

Top-k: keep only the $k$ highest-probability tokens; zero out the rest before softmax. Simple but ignores the shape of the distribution.

Top-p (nucleus sampling): sort tokens by probability; keep the smallest set whose cumulative probability exceeds $p$. Adapts to the distribution — in confident cases (one clear answer) keeps 1–2 tokens; in uncertain cases keeps many.

$$\text{nucleus} = \{w : \text{sorted until } \sum p_w \geq p\}$$

Most modern LLMs use T=1 + top-p≈0.9–0.95 as the default sampling strategy.

8
Context Window — What Limits How Much the Model Can "See"

What is the context window?

The maximum number of tokens the model can process in one forward pass — the size of the "working memory." Everything outside this window is invisible to the model.

Growing the context window is hard because:

  • Memory: KV cache grows linearly with context.
  • Compute: Attention is $O(n^2)$ in the context length.
  • Position generalization: Models trained on short sequences don't automatically work on longer ones.

RoPE + YaRN — Extending Context

RoPE encodes position by rotating Q/K vectors by angle $\theta \cdot m$ where $m$ is the position. The base frequency $\theta$ controls how fast angles rotate per step.

RoPE scaling tricks:

  • Linear interpolation: scale positions down by $s$ to fit larger context into trained range.
  • NTK-aware: scale different frequency components differently.
  • YaRN: ramp function that scales high-freq dims less aggressively. Used in LLaMA 3.1 for 128K context.

Claude 3.5 achieves 200K context; Gemini 1.5 achieves 1M tokens.

ModelContext WindowPosition EncodingKV Cache Size (est.)
GPT-4o128K tokensRoPE~26 GB @ 128K
Claude 3.5 Sonnet200K tokensRoPE~42 GB @ 200K
Gemini 1.5 Pro1M tokensALiBi + sparse~200 GB @ 1M
LLaMA 3.1 405B128K tokensRoPE + YaRN~20 GB @ 128K
9
Scaling Laws — Bigger = Better, But How?

Chinchilla scaling laws (Hoffmann et al. 2022): for a given compute budget, train a model with N parameters on D = 20N tokens.

Before Chinchilla, models were undertrained (too many parameters, too few tokens). Chinchilla showed that optimal compute requires equal scaling of both model size and data.

$$L(N, D) = \frac{A}{N^\alpha} + \frac{B}{D^\beta} + L_\infty \quad \alpha \approx 0.34,\; \beta \approx 0.28$$
Scaling laws — loss vs. model size at fixed compute
For each compute budget, there's an optimal model size. Too large = undertrained. Too small = capacity-limited. Chinchilla found many models (GPT-3, Gopher) were 3–4× undertrained.
10
Current Trends — What Makes 2024–2025 LLMs Fast

⚡ Flash Attention

Reorders attention computation to avoid materializing the full $n \times n$ attention matrix. Uses GPU SRAM tiling — brings data to compute rather than compute to data. 2–4× speedup on long sequences with exact same output.

IO-aware FlashAttn-3

🔮 Speculative Decoding

A small "draft" model generates 4–8 tokens cheaply; the large model verifies all of them in a single forward pass using parallel attention. If accepted, get multi-token throughput at near 1-token latency. 2–3× speedup.

Medusa SpecTr

🧩 Mixture of Experts (MoE)

Replace each FFN layer with $E$ expert FFNs. A router selects top-2 experts per token. Only 2/E experts are active per token — massive parameter count with moderate FLOP cost. GPT-4, Mixtral, Gemini 1.5 use MoE.

Mixtral 8×7B Switch Transformer

📦 Quantization

Reduce weights from FP32/BF16 to INT8 or INT4. 4-bit quantization (GPTQ, AWQ, GGUF) gives ~4× memory reduction with <1% quality drop on most tasks. Enables running 70B models on a single 48 GB GPU.

GGUF/llama.cpp AWQ GPTQ

🔄 Continuous Batching

Rather than waiting for all requests in a batch to finish, new requests are inserted into the running batch as soon as a slot opens. Dramatically improves GPU utilization for variable-length generations. Used by vLLM, TensorRT-LLM.

vLLM PagedAttention

🧠 Extended Thinking / Chain of Thought

Models like Claude 3.7, o1/o3, and DeepSeek-R1 generate internal reasoning tokens before the final answer. These "thinking" tokens improve math, coding, and reasoning by letting the model explore and self-correct before committing.

Claude 3.7 OpenAI o3 DeepSeek-R1
Speculative Decoding — draft → verify loop
11
How Claude (and GPT) Work End-to-End

Behind the chat interface is just a carefully formatted token sequence passed through the inference loop.

The Full Context Structure

# What Claude actually "sees" as tokens: <system> You are Claude, made by Anthropic. Today is May 31, 2026. Be helpful. </system> <human> What is attention in transformers? </human> <assistant> # ← model starts generating here Attention is a mechanism that...

All of this is flattened into one token sequence. There is no separate "system" processing — it's just context, with learned patterns from training that cause the model to treat <system> content as instructions.

Multi-Turn Conversations

Each new user turn simply appends to the same context window. The model never "remembers" previous turns — it re-reads the entire conversation history every time. This is why:

  • Long conversations use more tokens (and cost more).
  • Models can "forget" if context exceeds the window.
  • You can manipulate the model by injecting context early.

The "memory" in commercial products (like Claude Projects) is built by a separate system that selectively injects relevant past conversations into the context window.

Tool Use / Function Calling

When Claude uses a tool (web search, code execution), it works like this:

1

Model generates a special <tool_call> token sequence specifying tool name + JSON arguments.

2

Inference is paused. The tool runs in the real world (search, Python, database).

3

Tool output is appended to context as a <tool_result>. Inference resumes.

4

Model now "sees" the tool result and generates the final answer conditioned on it.

Streaming — How Tokens Appear One at a Time

When you see Claude typing character by character, it's because the server streams each token over HTTP (Server-Sent Events or WebSocket) as soon as it's sampled, rather than waiting for the full response. Each token takes ~30–80ms on production hardware (decode phase). The "typing" appearance is not theatrics — it's the actual decode loop.

# Python: Claude streaming with Anthropic SDK import anthropic client = anthropic.Anthropic() with client.messages.stream( model="claude-sonnet-4-6", max_tokens=1024, messages=[{"role": "user", "content": "Explain KV cache"}] ) as stream: for text in stream.text_stream: print(text, end="", flush=True) # each `text` is 1–3 tokens as they're decoded

The Complete Picture

Text → Tokenize (BPE, ~100K vocab) → Embed (lookup table + RoPE) → $L\times$ [RMSNorm → Multi-Head Attention → Residual → RMSNorm → SwiGLU FFN → Residual]LogitsSample (T + top-p) → append token → repeat.

The KV cache makes decode $O(n)$ instead of $O(n^2)$. Speculative decoding, Flash Attention, GQA, and MoE are engineering innovations that keep this loop fast at scale. The "intelligence" emerges from training on trillions of tokens — the inference machinery just runs the resulting function.

All articles written with the help of Claude. Source credited to Dr. Jon Barron.