← Learning

Video Encoding & Decoding

How raw pixels become a compressed bitstream — I, P, and B frames from first principles, GOP structure, quantization, rate-distortion trade-offs, and how FFmpeg ties it all together. Interactive visualizations throughout.

A 1080p video at 30 fps has roughly 178 million pixels per second. At 8 bits per channel (YCbCr 4:2:0), that's about 3 Gbps of raw data — impossible to stream over the internet or store efficiently. Video codecs solve this by exploiting two fundamental redundancies: spatial redundancy within a frame and temporal redundancy across frames. Understanding how they do this is the goal of this article.

Codecs Covered

The concepts here apply to all modern codecs: H.264/AVC, H.265/HEVC, H.266/VVC, VP9, and AV1. The terminology (I/P/B frames, GOP, QP) is universal. Implementation details (block sizes, transform types) differ by codec but the architecture is the same.

1. From Pixels to YCbCr — The Starting Point

Before any compression happens, raw video is typically converted from RGB to YCbCr. This is not a codec decision — it exploits human perception: our eyes are far more sensitive to brightness (luminance) than to color (chrominance).

$$\begin{bmatrix}Y\\Cb\\Cr\end{bmatrix} = \begin{bmatrix}0.299 & 0.587 & 0.114\\-0.169 & -0.331 & 0.500\\0.500 & -0.419 & -0.081\end{bmatrix}\begin{bmatrix}R\\G\\B\end{bmatrix} + \begin{bmatrix}0\\128\\128\end{bmatrix}$$

The 4:2:0 chroma subsampling format stores Cb and Cr at half the spatial resolution in both dimensions — one chroma sample per 2×2 luma block. This alone cuts data by 50% before any codec compression, with minimal perceived quality loss.

2. Block Partitioning — Dividing the Frame

Every codec works on rectangular blocks rather than individual pixels. The entire frame is partitioned into a grid of blocks, and each block is encoded independently (but potentially referencing other blocks for prediction).

Why Blocks?

Working in blocks enables: (1) the 2D DCT to be computed on manageable-size tiles; (2) motion compensation to match blocks rather than individual pixels; (3) rate-distortion optimization to make per-block coding decisions. Larger blocks are more efficient for smooth regions; smaller blocks capture fine details. Modern codecs use quad-tree partitioning to choose adaptively per region.

3. The Full Encoding Pipeline

A video encoder is a sequence of tightly coupled modules. Here is the complete forward path for a single block:

1
Prediction — generate a predicted block using either intra (spatial) or inter (temporal) prediction. Subtract from original to get the residual.
2
Transform (DCT/DST) — apply a 2D block transform to the residual. Compacts energy into a few low-frequency coefficients.
3
Quantization — divide each transform coefficient by a quantization step size and round to integer. This is the only lossy step.
4
Entropy Coding (CABAC/CAVLC) — losslessly compress the quantized coefficients using context-adaptive arithmetic or variable-length coding.
5
Reconstruction (encoder-side decode) — dequantize, inverse-transform, add back the prediction. This is the reference frame for future inter predictions — encoder and decoder must be in sync.
6
In-loop Filtering — deblocking filter (and SAO/ALF in HEVC/AV1) reduce blocking artifacts on the reference frames.
Key Insight — The Encoder Decodes Too

Step 5 is subtle but critical. The encoder must reconstruct the frame exactly as the decoder will, using only information already in the bitstream. Any floating-point mismatch would cause the prediction reference to drift, accumulating errors across frames. This is why all intermediate operations (DCT, IDCT, quantization) use integer arithmetic.

4. I Frames — Intra-Coded, Self-Contained

An I-frame (Intra-coded frame) is encoded using only information within that frame itself — no reference to any other frame. It is the starting point of every Group of Pictures and enables random access (seeking) in a video stream.

4.1 Intra Prediction

Rather than predicting from other frames, intra prediction uses already-encoded neighboring blocks within the same frame — blocks to the left and above (in raster scan order). This exploits local spatial correlation.

H.264 intra prediction modes (4×4 blocks):

H.265/HEVC expands this to 35 intra modes (33 angular directions + DC + Planar). AV1 uses up to 56 intra prediction modes, adding smooth directional and recursive filtering modes.

$$\text{Intra Residual} = \text{Original Block} - \text{Intra Predicted Block}$$

4.2 2D DCT on the Residual

The residual block (typically 4×4, 8×8, or 16×16) is transformed with the 2D Discrete Cosine Transform (Type-II). For an $N \times N$ block $f[x,y]$:

$$F[u,v] = \frac{2}{N}\,C_u\,C_v \sum_{x=0}^{N-1}\sum_{y=0}^{N-1} f[x,y]\cos\!\left(\frac{(2x+1)u\pi}{2N}\right)\cos\!\left(\frac{(2y+1)v\pi}{2N}\right)$$

where $C_k = \tfrac{1}{\sqrt{2}}$ if $k=0$, else $C_k = 1$. The top-left coefficient $F[0,0]$ is the DC coefficient (mean energy). All others are AC coefficients (oscillatory components). Natural images have most energy in the low-frequency (top-left) region — high-frequency coefficients are small and often zero after quantization.

Select a preset or click/drag cells in the Original block. Watch the full intra-encode pipeline update: Original → Prediction → Residual → DCT Spectrum.

5. P Frames — Predictive, Forward Reference

A P-frame (Predictively coded frame) can reference blocks from one or more previously decoded frames in the past. Most blocks are encoded as a motion vector (where to find the matching block in the reference) plus a residual (the difference after prediction). Blocks with a good match have near-zero residuals and cost almost nothing to encode.

5.1 Motion Estimation (Encoder)

The encoder searches for the best matching block in the reference frame within a search window. Given a current block at position $(x,y)$, it finds the displacement $(\Delta x, \Delta y)$ minimizing some distortion metric — typically Sum of Absolute Differences (SAD) or Sum of Squared Errors (SSE):

$$(\hat{\Delta x}, \hat{\Delta y}) = \arg\min_{\Delta x, \Delta y} \sum_{i,j} \left| \text{Current}[i,j] - \text{Ref}[i+\Delta y,\, j+\Delta x] \right|$$

The motion vector $\mathbf{v} = (\hat{\Delta x}, \hat{\Delta y})$ is encoded (typically as a difference from a predicted MV). Sub-pixel motion estimation — searching at half-pixel or quarter-pixel positions using interpolation — dramatically improves prediction quality for a small bitrate overhead.

5.2 Motion Compensation (Decoder)

The decoder doesn't search — it just uses the transmitted motion vector to fetch the reference block:

$$\text{Reconstructed Block} = \text{Ref}[y+\Delta y,\, x+\Delta x] + \text{IDCT}(\text{dequantized residual coefficients})$$
Asymmetry: Encoding is Expensive, Decoding is Fast

Motion estimation is computationally expensive — exhaustive block matching over a large search window costs $O(W^2)$ operations per block where $W$ is the search range. Real-time encoders use fast heuristics: Three-Step Search, Diamond Search, or learned search strategies.
Motion compensation (decoder side) is just a table lookup followed by an IDCT — orders of magnitude cheaper.

Adjust Object Velocity to move the subject between frames. The SAD heatmap shows encoder search cost at every candidate position — green = good match, red = poor match. The best match is found at the minimum.

SAD: —  |  MV: (—, —)

6. B Frames — Bidirectional Reference

A B-frame (Bidirectionally predicted frame) can reference frames from both the past and the future. It is the most compression-efficient frame type because it has more prediction candidates to choose from.

6.1 Reference Frame Structure

For a B-frame at time $t$, let $R_-$ be a past reference and $R_+$ be a future reference. Each block can be predicted as:

B-Frame Decoding Order ≠ Display Order

Because B-frames reference the future, the encoder must delay transmission. If the display order is I B B P, the bitstream order is I P B B — the encoder transmits P before B so the decoder has both references available before decoding the B-frames. This introduces a decode buffer delay, which is why low-latency streaming (live video, gaming) avoids B-frames.

Why do B-frames compress better than P-frames?
B-frames have access to the future reference, which often gives a much better match for occluded regions, newly appearing objects, or motion blur. The bidirectional average prediction also naturally handles objects moving at sub-block resolution — neither forward nor backward prediction alone captures it as well as their average. In practice B-frames reduce bitrate by 10–40% vs P-frames for the same quality.
Can B-frames reference other B-frames?
In H.264 and earlier, B-frames cannot be reference frames. In H.265/HEVC, "B-frames used as reference" are allowed — sometimes called hierarchical B-frames or RA (Random Access) coding. This enables reference pyramid structures where B-frames at coarser temporal scales are decoded first and used as references for finer-scale B-frames.
What is a "skip" block?
A skip block is one where the encoder signals that the entire block — motion vector and residual — is to be inferred from context (typically the median-predicted MV with zero residual). It costs almost zero bits and is extremely common in static regions. P-skip uses forward prediction; B-skip uses direct mode. In practice, 30–60% of macroblocks in a typical video are skip blocks.

7. GOP — Group of Pictures

A Group of Pictures (GOP) is the sequence of frames between two consecutive I-frames. Its structure determines the compression efficiency, random access granularity, and error resilience of the video.

7.1 GOP Parameters

7.2 Open vs. Closed GOP

PropertyOpen GOPClosed GOP
B-frames reference across boundary?YesNo
Random access at I-frame?Need prior framesClean cut
Compression efficiencyHigherLower
Adaptive bitrate (HLS/DASH)?HarderRequired
Frame breakdown: I=1 · P=— · B=—
GOP Size Trade-offs

Smaller GOP (e.g., N=15): better error recovery and random access, worse compression. Used for broadcast TV and streaming where seeking matters.
Larger GOP (e.g., N=120): better compression (more temporal prediction coverage), poor seekability. Used for offline encoding (Blu-ray, archival).
Keyframe interval in practice: adaptive streaming (HLS/DASH) requires segment boundaries to align with I-frames. With 2-second segments at 30fps, GOP size = 60 is common.

8. Quantization — The Only Lossy Step

Quantization maps continuous transform coefficients to a discrete set of integer levels. It is the sole source of quality loss in a standard video codec. Everything else — prediction, transform, entropy coding — is lossless (within the codec's precision).

8.1 Scalar Quantization

Each DCT coefficient $F[u,v]$ is quantized independently:

$$\hat{F}[u,v] = \text{round}\!\left(\frac{F[u,v]}{Q_{\text{step}}(u,v)}\right), \qquad \tilde{F}[u,v] = \hat{F}[u,v] \times Q_{\text{step}}(u,v)$$

where $Q_{\text{step}}(u,v)$ is the quantization step size for coefficient $(u,v)$, $\hat{F}$ is the quantized (encoded) value, and $\tilde{F}$ is the dequantized (reconstructed) value. The quantization error is $F[u,v] - \tilde{F}[u,v] \in \left(-\tfrac{Q_{\text{step}}}{2}, \tfrac{Q_{\text{step}}}{2}\right]$.

8.2 Quantization Parameter (QP)

In H.264/HEVC, the step size is controlled by a single Quantization Parameter QP $\in [0, 51]$:

$$Q_{\text{step}} \approx 2^{(QP - 4)/6}$$

Every increase of 6 in QP doubles the step size — halving the precision and roughly halving the bitrate. The relationship is approximately logarithmic: QP=18 gives roughly 4× the bitrate of QP=30 at the same content.

8.3 Quantization Matrix — Perceptual Weighting

Real encoders don't use a flat $Q_{\text{step}}$ across all DCT coefficients. They apply a quantization matrix (also called weight matrix or scaling list) that coarsely quantizes high-frequency coefficients (top-right of the DCT block) and finely quantizes low-frequency ones — matching human contrast sensitivity:

$$Q_{\text{step}}(u,v) = Q_{\text{base}}(QP) \times W[u,v], \quad W[u,v] \uparrow \text{ as } u+v \uparrow$$
Non-zero coefficients: — / 64

8.4 Rate-Distortion Optimization (RDO)

A codec's encoder doesn't just minimize distortion — it minimizes a weighted combination of distortion $D$ and bitrate cost $R$:

$$J = D + \lambda \cdot R$$

The Lagrange multiplier $\lambda$ controls the trade-off. For H.264/HEVC it is approximated as $\lambda \approx 0.85 \times 2^{(QP-12)/3}$. The encoder evaluates every coding decision (partition size, prediction mode, QP) by this joint cost function and picks the minimum-cost option. This is why encoding is so slow — RDO is evaluated exhaustively across exponentially many candidate coding choices.

Perceptual Rate-Distortion

Standard RD optimization uses PSNR-aligned distortion (SSE in pixel domain), which doesn't always match human perception. Modern encoders (x265 psy-rd, SVT-AV1) add perceptual terms — preserving film grain, texture energy, and contrast — at the cost of slightly higher bitrate. Psychovisual optimization is a large area of active research.

9. Entropy Coding — Losslessly Packing Coefficients

After quantization, many coefficients are zero (especially high-frequency ones). Entropy coding compresses this sparse representation losslessly.

9.1 Zig-Zag Scanning

The 2D block of quantized coefficients is scanned into a 1D sequence in zig-zag order — starting from the DC coefficient (top-left) and traversing diagonals that alternate direction. This groups the non-zero low-frequency coefficients at the start, followed by a long tail of zeros that compress efficiently as a run-length encoded "EOB" (End of Block) token.

9.2 CAVLC vs CABAC

MethodUsed InApproachEfficiencyComplexity
CAVLCH.264 Baseline/MainContext-adaptive VLC tablesGoodLow
CABACH.264 High, H.265, AV1Context-adaptive arithmetic coding~10–15% betterHigh

CABAC (Context-Adaptive Binary Arithmetic Coding) encodes each syntax element as a binary symbol whose probability is maintained and updated by a context model. The arithmetic coder achieves near-optimal entropy without needing symbol tables. The context model adapts to local statistics, giving strong compression of the coefficient patterns.

10. The Decoding Pipeline

Decoding is the strict inverse of encoding, and crucially, it is deterministic — the decoder receives all decisions from the bitstream and never needs to search or optimize.

1
Bitstream Parse & NAL Unit Extraction — split the compressed bitstream into Network Abstraction Layer units (parameter sets, slices, SEI messages).
2
Entropy Decode (CABAC/CAVLC) — recover quantized coefficients, motion vectors, partition sizes, and prediction modes for each block.
3
Dequantization + Inverse Transform (IDCT) — multiply coefficients by $Q_{\text{step}}$ to recover approximated residuals, then apply 2D IDCT.
4
Prediction Reconstruction — for I-blocks: apply intra prediction from already-decoded neighbors. For P/B-blocks: fetch reference block(s) from decoded frame buffer using motion vectors.
5
Residual Addition — add the IDCT residual to the predicted block to reconstruct the decoded block.
6
In-loop Filtering — deblocking filter removes block boundary artifacts. HEVC adds SAO (Sample Adaptive Offset); AV1 adds CDEF and LR (loop restoration) filters.
7
Decoded Picture Buffer (DPB) — completed frames are stored as references for future inter prediction (and for B-frames, reordered to display order).
8
Output & Display Reorder — frames are output in display order (which may differ from decode order due to B-frames). Color space conversion YCbCr → RGB for display.

Step through the decode pipeline of a single P-frame block.

Step 1 of 6: Parse bitstream
The NAL unit is parsed. Block partition (16×16 MB), prediction mode (inter/P), motion vector MV=(+8,+3), and quantized coefficient array are extracted.

11. FFmpeg — Putting It All Together

FFmpeg is the open-source Swiss army knife for multimedia: a framework, command-line tool, and library that wraps libavcodec (codecs), libavformat (containers), libavfilter (filters), and libswscale (color conversion). Understanding how it maps to the concepts above makes its vast option space manageable.

11.1 Basic Encoding

ShellH.264 encode
ffmpeg -i input.mp4 \
  -c:v libx264 \        # use x264 (H.264) encoder
  -crf 23 \             # Constant Rate Factor (quality knob, ~QP)
  -preset medium \      # speed/quality tradeoff (ultrafast→veryslow)
  -c:a copy \           # copy audio stream unchanged
  output.mp4

11.2 CRF vs QP vs Target Bitrate

ModeFlagBehaviorUse Case
CRF-crf 0–51Constant perceptual quality; bitrate varies per sceneOffline archival, file-size budget
CQP-qp 0–51Fixed QP per frame type; no rate controlBenchmarking, research
ABR-b:v 4MAverage bitrate over the whole fileApproximate file size target
CBR-b:v 4M -maxrate 4M -bufsize 8MConstant bitrate (padded/stuffed)Broadcast, streaming ingest
2-pass ABR-pass 1 / -pass 2First pass analyzes content, second pass encodes optimallyBest quality at exact file size

11.3 Controlling Frame Types and GOP

ShellGOP and frame-type control
# Set GOP size to 60 (2 seconds at 30fps), 2 B-frames between refs
ffmpeg -i input.mp4 -c:v libx264 \
  -g 60 \               # GOP size (keyframe interval)
  -keyint_min 60 \      # min distance between keyframes (force fixed GOP)
  -bf 2 \               # number of B-frames
  -b_strategy 1 \       # adaptive B-frame placement
  -crf 23 output.mp4

# Disable B-frames entirely (for low-latency streaming)
ffmpeg -i input.mp4 -c:v libx264 -tune zerolatency -bf 0 output.mp4

# Force all-I (lossless baseline) — largest file, instant seek
ffmpeg -i input.mp4 -c:v libx264 -intra output.mp4

11.4 Per-Frame QP and Scene Detection

Shellx264 tune and QP offsets
# x264 internal QP offsets per frame type (I gets lower QP = higher quality)
# Default: i_qfactor=0.71, b_qfactor=1.30 relative to P-frame QP
ffmpeg -i input.mp4 -c:v libx264 \
  -x264-params "i-qfactor=0.71:b-qfactor=1.30:scenecut=40" \
  -crf 23 output.mp4

# Detect scene cuts (scenecut threshold 40 = moderate sensitivity)
# x264 inserts an IDR frame when the scene change score exceeds this value

11.5 HEVC (H.265) with FFmpeg

ShellH.265/HEVC encode
ffmpeg -i input.mp4 \
  -c:v libx265 \             # x265 encoder
  -crf 28 \                  # ~same visual quality as x264 crf=23 (~50% bitrate savings)
  -preset medium \
  -x265-params "keyint=60:bframes=4:b-adapt=2:ref=4" \
  output_hevc.mp4

# Hardware-accelerated HEVC (NVIDIA NVENC)
ffmpeg -i input.mp4 \
  -c:v hevc_nvenc \
  -rc vbr \
  -cq 28 \
  -b:v 0 \
  output_nvenc.mp4

11.6 Reading Codec Info and Frame Types

ShellInspect bitstream
# Print every frame's type, QP, size, and PTS
ffprobe -v quiet -select_streams v:0 \
  -show_frames -show_entries \
  frame=pict_type,pkt_size,best_effort_timestamp_time \
  -of csv input.mp4 | head -30

# Output: frame,I,12345,0.000000
#         frame,P,3210,0.033333
#         frame,B,980,0.066667

# Visualize GOP structure and bitrate distribution
ffprobe -v quiet -show_frames -select_streams v:0 \
  -print_format json input.mp4 | \
  python3 -c "
import json,sys
frames=json.load(sys.stdin)['frames']
for f in frames[:60]:
    t=f['pict_type'];s=int(f['pkt_size'])
    bar='█'*(s//500)
    print(f'{t} {bar} {s}B')
"

11.7 FFmpeg's Internal libavcodec Flow

When you call ffmpeg -c:v libx264, the following pipeline runs inside libavcodec for every frame:

Pseudocodelibavcodec encode loop
for frame in input_frames:
    # 1. Rate control: decide QP for this frame
    qp = rate_controller.get_qp(frame, target_bitrate)

    # 2. Frame-type decision (I/P/B based on scene change score, GOP rules)
    frame_type = decide_frame_type(frame, prev_frames, gop_state)

    # 3. For each CTU/macroblock in the frame:
    for block in partition(frame):
        # 3a. Try all prediction modes
        best_cost = inf
        for mode in intra_modes + inter_modes:
            pred = predict(block, mode, reference_frames)
            residual = block - pred
            coeffs = dct2d(residual)
            q_coeffs = quantize(coeffs, qp, quant_matrix)
            bits = entropy_estimate(q_coeffs, mode)
            distortion = sse(block, idct2d(dequantize(q_coeffs)) + pred)
            cost = distortion + lambda_ * bits  # RDO
            if cost < best_cost:
                best_cost = cost
                best = (mode, q_coeffs)

        # 3b. Write best decision to bitstream
        entropy_code(best, bitstream)

        # 3c. Reconstruct for use as future reference
        reconstructed[block] = idct2d(dequantize(best.q_coeffs)) + predict(block, best.mode)

    deblock_filter(reconstructed_frame)
    dpb.store(reconstructed_frame)
x264 Preset — What It Really Controls

The -preset flag (ultrafast → veryslow) changes ~20 internal x264 parameters simultaneously: search range, number of reference frames, subpel refinement iterations, B-frame lookahead depth, trellis quantization, mixed-refs, and more. ultrafast turns off most of these for maximum speed; veryslow enables all of them for maximum compression. At a fixed CRF, going from medium to veryslow typically saves another 5–15% bitrate at the cost of 10–30× encoding time.

11.8 Complete Real-World Streaming Encode

ShellHLS adaptive bitrate ladder
# Generate a 3-rung HLS ladder: 1080p, 720p, 360p
# Each rung has a fixed 2-second GOP aligned to segment boundaries
ffmpeg -i input.mp4 \
  -filter_complex \
    "[0:v]split=3[v1][v2][v3]; \
     [v1]scale=1920:1080[v1out]; \
     [v2]scale=1280:720[v2out]; \
     [v3]scale=640:360[v3out]" \
  -map "[v1out]" -c:v:0 libx264 -crf 18 -preset fast \
    -g 60 -keyint_min 60 -sc_threshold 0 \
    -b:v:0 5000k -maxrate:v:0 5350k -bufsize:v:0 10000k \
  -map "[v2out]" -c:v:1 libx264 -crf 21 -preset fast \
    -g 60 -keyint_min 60 -sc_threshold 0 \
    -b:v:1 2800k -maxrate:v:1 2996k -bufsize:v:1 5600k \
  -map "[v3out]" -c:v:2 libx264 -crf 26 -preset fast \
    -g 60 -keyint_min 60 -sc_threshold 0 \
    -b:v:2 800k -maxrate:v:2 856k -bufsize:v:2 1600k \
  -map 0:a -c:a:0 aac -b:a:0 128k \
  -map 0:a -c:a:1 aac -b:a:1 128k \
  -map 0:a -c:a:2 aac -b:a:2 64k \
  -f hls -hls_time 2 -hls_playlist_type vod \
  -hls_segment_filename "seg_%v_%03d.ts" \
  -master_pl_name master.m3u8 \
  -var_stream_map "v:0,a:0 v:1,a:1 v:2,a:2" \
  stream_%v.m3u8

12. Summary — How Everything Connects

ConceptWhat It ExploitsKey ParameterFFmpeg Flag
YCbCr 4:2:0Human luma/chroma sensitivityChroma ratio-pix_fmt yuv420p
I FrameSpatial correlation within frameIntra mode count-g (interval)
P FrameTemporal correlation (past)Search range, ref count-bf 0 to disable B
B FrameTemporal correlation (past + future)B-frame count M-bf N
GOPError containment + seek granularityGOP size N-g N
DCTEnergy compaction (spatial frequency)Block sizeCodec-internal
QuantizationPerceptual irrelevance of fine detailQP (0–51)-crf / -qp
CABACStatistical redundancy of symbolsContext model depth-profile high
RDOOptimal bit allocation per blockλ (QP-derived)-preset (trellis)
The Fundamental Compression Loop

Every frame type follows the same pattern: predict → subtract → transform → quantize → entropy code. The differences are only in what prediction source is used. I-frames predict from spatial neighbors within the frame. P-frames predict from past reconstructed frames. B-frames predict from both past and future. The residual pipeline — DCT, quantization, CABAC — is identical for all three. This elegant uniformity is what makes the standard modular and extensible across generations of codecs.