LLM Inference — From Tokens to Claude
How does typing a prompt produce an answer? Covers everything from scratch:
tokenization (BPE), embeddings, RoPE positional encoding, scaled dot-product
attention, multi-head attention, transformer blocks (RMSNorm, SwiGLU, residuals),
autoregressive decoding, KV cache, GQA, sampling strategies (temperature, top-p),
context windows, scaling laws, and current trends — speculative decoding, MoE,
Flash Attention, quantization. All interactive.
Transformers
Attention
KV Cache
Sampling
Speculative Decoding
Claude