← LearningMyself

Singular Value Decomposition

Every matrix is secretly three simple operations: rotate, stretch, rotate back. SVD reveals them — and unlocks image compression, PCA, pseudoinverses, and recommender systems.

A matrix transforms space — it stretches, rotates, and shears vectors. The Singular Value Decomposition (SVD) breaks any matrix into three interpretable pieces: a rotation in input space, a coordinate-wise scaling, and a rotation in output space. This factorization works for any real or complex matrix, square or rectangular, invertible or singular. It is arguably the most important matrix decomposition in applied mathematics.

1. The Factorization A = UΣVᵀ

For any real $m \times n$ matrix $A$ of rank $r$, the SVD states:

$$\underbrace{A}_{m\times n} = \underbrace{U}_{m\times m}\;\underbrace{\Sigma}_{m\times n}\;\underbrace{V^\top}_{n\times n}$$

Each factor has a precise meaning:

U — Left Singular Vectors

An $m\times m$ orthogonal matrix ($UU^\top = I$). Its columns $\mathbf{u}_1,\dots,\mathbf{u}_m$ are an orthonormal basis for $\mathbb{R}^m$ — they span the output space.

Σ — Singular Values

An $m\times n$ diagonal matrix. The diagonal entries $\sigma_1 \ge \sigma_2 \ge \cdots \ge \sigma_r > 0 = \sigma_{r+1} = \cdots$ are the singular values — always real and non-negative.

V — Right Singular Vectors

An $n\times n$ orthogonal matrix ($VV^\top = I$). Its columns $\mathbf{v}_1,\dots,\mathbf{v}_n$ are an orthonormal basis for $\mathbb{R}^n$ — they span the input space.

The Fundamental Relations

$A\mathbf{v}_i = \sigma_i \mathbf{u}_i$ — the matrix maps each right singular vector to $\sigma_i$ times the corresponding left singular vector. $A^\top \mathbf{u}_i = \sigma_i \mathbf{v}_i$ for the reverse direction.

Written as a sum of rank-1 outer products, the SVD is:

$$A = \sum_{i=1}^{r} \sigma_i\, \mathbf{u}_i \mathbf{v}_i^\top$$

Each term $\sigma_i \mathbf{u}_i \mathbf{v}_i^\top$ is a rank-1 matrix. The singular values $\sigma_i$ tell you how much each "layer" contributes — the first term is the most important, the last is the least. This is the foundation of low-rank approximation.

Existence & Uniqueness

SVD exists for every real or complex matrix — no symmetry, invertibility, or square shape required. The singular values are unique. The singular vectors are unique when all singular values are distinct; otherwise there is freedom within the eigenspaces of equal $\sigma_i$.

2. Geometry: Rotate → Stretch → Rotate

Any linear map sends a unit sphere to an ellipsoid. SVD tells you exactly how: first rotate the sphere by $V^\top$ (so the "interesting axes" align with the coordinate axes), then stretch each axis by the corresponding singular value via $\Sigma$, then rotate the ellipsoid into its final orientation via $U$. The singular values are the semi-axes lengths of the output ellipsoid.

In 2D: the unit circle maps to an ellipse. The interactive below makes every step concrete. Set $A = \bigl[\begin{smallmatrix}a & b \\ c & d\end{smallmatrix}\bigr]$ with the sliders, then click Play. The left panel (input space) is fixed — it shows the unit circle, the right singular vectors $\mathbf{v}_1$ (red) and $\mathbf{v}_2$ (blue), and a draggable yellow point p you can place anywhere. The right panel shows what happens to both the circle and your point through each transformation stage.

A
1.000.50
0.301.20
=
U
·
Σ
0.00
0.00
·
Vᵀ
Presets:

Notice: in Step 1 the unit circle stays circular — rotations preserve shape. Only in Step 2 does the stretching occur. The singular values $\sigma_1, \sigma_2$ are the semi-axis lengths of the final ellipse. When $\det(A) = 0$, one singular value collapses to zero, flattening the ellipse to a line.

3. Eigenvalues vs Singular Values

Eigenvalues and singular values measure fundamentally different things. Eigenvalues describe how $A$ acts on its own invariant subspaces — they can be negative or complex even for real matrices. Singular values measure the "stretching power" of $A$ in any direction — they are always real and non-negative.

3.1 The AᵀA Connection

Substitute $A = U\Sigma V^\top$ into $A^\top A$:

$$A^\top A = (U\Sigma V^\top)^\top(U\Sigma V^\top) = V\Sigma^\top U^\top U \Sigma V^\top = V\underbrace{(\Sigma^\top\Sigma)}_{\text{diag}(\sigma_i^2)}V^\top$$

This is the eigendecomposition of $A^\top A$! The columns of $V$ are eigenvectors of $A^\top A$, and the eigenvalues are $\lambda_i = \sigma_i^2$. Similarly, $AA^\top = U\Sigma\Sigma^\top U^\top$, so columns of $U$ are eigenvectors of $AA^\top$ with the same non-zero eigenvalues. Therefore:

The Key Relation

$\sigma_i(A) = \sqrt{\lambda_i(A^\top A)}$. Singular values are square roots of eigenvalues of $A^\top A$ (or equivalently $AA^\top$). Since $A^\top A$ is symmetric positive semi-definite, its eigenvalues are always $\ge 0$, making the square root well-defined.

For a symmetric PSD matrix: eigenvalues = singular values (both equal $\lambda_i \ge 0$). For a symmetric indefinite matrix: $\sigma_i = |\lambda_i|$ (singular values are absolute values of eigenvalues).

3.2 Why Eigenvalues Fail for General Matrices

Consider a 90° rotation matrix $R = \bigl[\begin{smallmatrix}0 & -1 \\ 1 & 0\end{smallmatrix}\bigr]$. Its eigenvalues are $\pm i$ (complex) — useless for understanding how much it stretches space. But its singular values are both $1$ (it preserves all lengths). SVD handles this correctly; eigendecomposition cannot.

Presets:

The visualization above shows three spectra simultaneously: eigenvalues of $A$ (which can be complex, shown on the Argand plane), eigenvalues of $A^\top A$ (always real and non-negative), and singular values of $A$ (= $\sqrt{\lambda_i(A^\top A)}$). Try the Rotation preset — the eigenvalues of $R$ are $\pm i$ (purely imaginary) yet the singular values are both exactly $1$.

4. Low-Rank Approximation — The Eckart–Young Theorem

The SVD's most powerful property: it gives the best possible low-rank approximation. Truncate after $k$ terms:

$$A_k = \sum_{i=1}^{k} \sigma_i \mathbf{u}_i \mathbf{v}_i^\top = U_k \Sigma_k V_k^\top$$

Eckart–Young–Mirsky theorem (1936): Among all rank-$k$ matrices $B$, the truncated SVD minimizes both the Frobenius norm error and the spectral norm error:

$$\|A - A_k\|_F^2 = \sum_{i=k+1}^{r}\sigma_i^2 \qquad \|A - A_k\|_2 = \sigma_{k+1}$$

The fraction of "energy" (variance) captured by the rank-$k$ approximation is $\sum_{i=1}^k \sigma_i^2 \,/\, \sum_{i=1}^r \sigma_i^2$. Natural signals (images, text, sensor data) typically have rapidly decaying singular values — a small $k$ captures most of the energy. The visualization below demonstrates this on a synthetic image generated with an offscreen canvas.

Original
Rank-1 Approximation
Computing SVD…

Watch how the first few singular values encode the coarse structure (overall brightness, main shapes), while later components add progressively finer detail. Most images are well-approximated by surprisingly few singular values — this is why JPEG uses a related transform (DCT) and why neural network weight matrices can often be compressed without much accuracy loss.

low_rank_approx.pyNumPy
import numpy as np

def low_rank_approx(A: np.ndarray, k: int) -> np.ndarray:
    """Best rank-k approximation of A (Eckart-Young theorem)."""
    U, s, Vt = np.linalg.svd(A, full_matrices=False)
    return (U[:, :k] * s[:k]) @ Vt[:k, :]

def energy_fraction(A: np.ndarray, k: int) -> float:
    """Fraction of Frobenius energy captured by rank-k approximation."""
    _, s, _ = np.linalg.svd(A, full_matrices=False)
    return (s[:k]**2).sum() / (s**2).sum()

A = np.random.randn(100, 80)
A5 = low_rank_approx(A, k=5)
print(f"Error ||A - A5||_F / ||A||_F = {np.linalg.norm(A-A5)/np.linalg.norm(A):.4f}")
print(f"Energy captured at k=5: {energy_fraction(A, 5)*100:.1f}%")

5. PCA via SVD

Principal Component Analysis (PCA) finds the directions of maximum variance in a dataset. It is exactly equivalent to SVD of the centered data matrix.

5.1 The Derivation

Let $X \in \mathbb{R}^{n\times d}$ be $n$ data points in $d$ dimensions. Center the data: $\tilde{X} = X - \bar{X}$. The sample covariance matrix is $C = \tilde{X}^\top \tilde{X} / (n-1)$. PCA finds the eigendecomposition $C = Q\Lambda Q^\top$.

Now apply SVD to $\tilde{X} = U\Sigma V^\top$. Then:

$$C = \frac{\tilde{X}^\top \tilde{X}}{n-1} = \frac{V \Sigma^\top \Sigma V^\top}{n-1} = V \underbrace{\frac{\Sigma^2}{n-1}}_{\Lambda} V^\top$$

The right singular vectors $V$ are the PCA loading vectors (principal directions). The variance explained by the $k$-th principal component is $\lambda_k = \sigma_k^2 / (n-1)$. The scores (coordinates of data in PC space) are $\tilde{X}V = U\Sigma$.

SVD vs Eigendecomposition for PCA

Computing eigendecomposition of $C = \tilde{X}^\top\tilde{X}/(n-1)$ squares the condition number: if $A$ has condition number $\kappa$, then $A^\top A$ has $\kappa^2$. Applying SVD directly to $\tilde{X}$ is numerically more stable. For $n \ll d$ (more features than samples), SVD of the $n\times d$ matrix is also more efficient.

pca_via_svd.pyNumPy
import numpy as np

def pca_via_svd(X: np.ndarray, n_components: int):
    """PCA using SVD — numerically preferred over covariance eigendecomposition."""
    X_c = X - X.mean(axis=0)                 # center
    U, s, Vt = np.linalg.svd(X_c, full_matrices=False)

    # Principal directions: rows of Vt (= columns of V)
    components = Vt[:n_components]            # (n_components, d)

    # Scores: projection of each sample onto each PC
    scores = X_c @ components.T              # (n, n_components)

    # Variance explained per PC
    explained_var = s[:n_components]**2 / (len(X) - 1)
    explained_ratio = explained_var / (s**2 / (len(X)-1)).sum()

    return scores, components, explained_ratio

# Example
rng = np.random.default_rng(42)
X = rng.multivariate_normal([0,0], [[4,3],[3,3]], size=200)
scores, pcs, ratios = pca_via_svd(X, n_components=2)
print(f"PC1 explains {ratios[0]*100:.1f}%, PC2 explains {ratios[1]*100:.1f}%")

5.2 Why the Columns of V Are the PCA Loading Vectors

The connection is one substitution away. Apply thin SVD to the centered data matrix $\tilde{X} = U\Sigma V^\top$ and substitute into the sample covariance:

$$C = \frac{\tilde{X}^\top \tilde{X}}{n-1} = \frac{(U\Sigma V^\top)^\top(U\Sigma V^\top)}{n-1} = \frac{V\Sigma \underbrace{U^\top U}_{=\,I} \Sigma V^\top}{n-1} = V \underbrace{\frac{\Sigma^2}{n-1}}_{\Lambda} V^\top$$

This is the eigendecomposition of $C$. The diagonal matrix $\Lambda = \Sigma^2/(n-1)$ contains the eigenvalues, and $V$ contains the eigenvectors — column $\mathbf{v}_k$ is the $k$-th principal direction. No extra computation needed: the SVD of $\tilde{X}$ delivers the PCA directions directly.

Three equivalent statements — same vector

1. $\mathbf{v}_k$ = $k$-th right singular vector of $\tilde{X}$  (SVD definition)

2. $\mathbf{v}_k$ = $k$-th eigenvector of $C = \tilde{X}^\top\tilde{X}/(n-1)$  (covariance eigendecomposition)

3. $\mathbf{v}_k$ = direction maximising explained variance, orthogonal to $\mathbf{v}_1,\dots,\mathbf{v}_{k-1}$  (PCA definition)

The PCA scores (coordinates of each data point in the PC basis) are $\tilde{X}V = U\Sigma$ — the left singular vectors scaled by the singular values. So one thin SVD of $\tilde{X}$ gives you both the principal directions ($V$, columns = PCs) and the projections ($U\Sigma$, rows = scores).

5.3 Application: Surface Normal Estimation in 3D Point Clouds

One of the most elegant uses of 3D PCA: given a neighbourhood of points sampled from a surface, find the surface normal — the direction perpendicular to the local tangent plane — using nothing but SVD.

Pick any point $\mathbf{p}$. Collect its $k$ nearest neighbours into a $k\times 3$ matrix $Q$. Center it: $\tilde{Q} = Q - \bar{Q}$. Compute the $3\times 3$ covariance $C = \tilde{Q}^\top\tilde{Q}/(k-1)$ and diagonalise via SVD (or equivalently eigendecomposition):

$$C = V\,\mathrm{diag}(\sigma_1^2,\,\sigma_2^2,\,\sigma_3^2)\,V^\top, \qquad \sigma_1 \ge \sigma_2 \ge \sigma_3 \ge 0$$

If the surface is locally smooth and nearly planar, the points spread in two directions (the tangent plane) and barely at all in the perpendicular direction (the normal). Consequently:

$$\sigma_1 \approx \sigma_2 \gg \sigma_3 \approx 0 \qquad \Longrightarrow \qquad \mathbf{v}_3 \;\text{is the surface normal}$$

Planarity score: $\lambda_p = 1 - \sigma_3/\sigma_1 \in [0,1]$. Near 1 means a clean planar patch; near 0 means a corner, edge, or noise-dominated cloud. The interactive below lets you tune both the noise level and the plane tilt — drag to orbit the 3D view.

Sign Ambiguity

SVD determines $\mathbf{v}_3$ only up to a sign flip ($\pm\mathbf{v}_3$ are both valid normals). In practice the orientation is resolved by requiring the normal to point toward the sensor (viewpoint consistency), or by propagating a consistent orientation through the graph of overlapping neighbourhoods (normal propagation in Open3D / PCL).

surface_normals.pyNumPy · scikit-learn
import numpy as np
from sklearn.neighbors import NearestNeighbors

def estimate_normals(points: np.ndarray, k: int = 20):
    """Estimate surface normals via local PCA (SVD of neighbourhood matrix)."""
    n = len(points)
    normals   = np.zeros_like(points)
    planarity = np.zeros(n)

    nbrs = NearestNeighbors(n_neighbors=k + 1).fit(points)
    _, indices = nbrs.kneighbors(points)

    for i in range(n):
        Q   = points[indices[i, 1:]]         # k neighbours (exclude self)
        Q_c = Q - Q.mean(axis=0)             # centre
        _, s, Vt = np.linalg.svd(Q_c, full_matrices=False)  # thin SVD
        normals[i]   = Vt[-1]                # last row of Vt = v₃ (min variance)
        planarity[i] = 1 - s[-1] / s[0] if s[0] > 1e-10 else 0.0

    return normals, planarity

# --- Example: half-sphere point cloud -----------------------------------------
rng   = np.random.default_rng(42)
phi   = rng.uniform(0, np.pi / 2, 500)
theta = rng.uniform(0, 2 * np.pi, 500)
pts   = np.stack([np.sin(phi) * np.cos(theta),
                  np.sin(phi) * np.sin(theta),
                  np.cos(phi)], axis=1)
pts  += rng.normal(0, 0.02, pts.shape)       # small noise

normals, planarity = estimate_normals(pts, k=15)
print(f"Mean planarity : {planarity.mean():.3f}")   # ≈ 0.93 for clean sphere
print(f"Normal at pt[0]: {normals[0]}")             # ≈ outward radial direction

6. Applications

6.1 Pseudoinverse and Least Squares

For a system $A\mathbf{x} = \mathbf{b}$ with no exact solution (overdetermined or underdetermined), the minimum-norm least-squares solution is $\mathbf{x}^* = A^+ \mathbf{b}$, where $A^+$ is the Moore–Penrose pseudoinverse:

$$A^+ = V\Sigma^+ U^\top \qquad \text{where } \Sigma^+_{ii} = \begin{cases}1/\sigma_i & \sigma_i > 0 \\ 0 & \sigma_i = 0\end{cases}$$

This gracefully handles rank-deficient matrices: directions in the null space of $A$ (where $\sigma_i = 0$) contribute nothing to $\mathbf{x}^*$, giving the minimum-norm solution among all least-squares solutions.

6.2 Condition Number

The condition number $\kappa(A) = \sigma_{\max} / \sigma_{\min}$ measures how much $A$ amplifies perturbations. A large $\kappa$ means a tiny change in $\mathbf{b}$ can cause a large change in the solution $\mathbf{x}$. An orthogonal matrix has $\kappa = 1$ (perfectly conditioned). A near-singular matrix has $\kappa \to \infty$.

6.3 Latent Semantic Analysis

Given a term–document matrix $T$ (rows = words, columns = documents, entries = TF-IDF weights), the truncated SVD $T_k = U_k\Sigma_k V_k^\top$ discovers latent topics. Documents with similar $\Sigma_k V_k^\top$ rows are semantically related even if they share no words. This is the foundation of modern embedding-based information retrieval.

6.4 Matrix Completion & Recommender Systems

Netflix ratings, user–item interactions, and sensor grids are naturally low-rank matrices with many missing entries. Algorithms like SVD++ and ALS factorize the observed entries into $U\Sigma V^\top$, then predict missing values from the low-rank completion. The singular values control the effective rank (complexity) of the learned model.

6.5 Noise Reduction via Truncation

A noisy observation $\tilde{A} = A + N$ (signal + noise) tends to have large singular values for the signal components and small singular values for noise (since Gaussian noise spreads its energy across all singular values equally). Hard-thresholding: set $\sigma_i = 0$ if $\sigma_i < \tau$, then reconstruct. This is the basis of optimal hard-threshold algorithms (Gavish–Donoho 2014).

applications.pyNumPy
import numpy as np

# ── Pseudoinverse ────────────────────────────────────────────────────────
def pseudoinverse(A: np.ndarray, tol: float = 1e-10) -> np.ndarray:
    U, s, Vt = np.linalg.svd(A, full_matrices=False)
    s_inv = np.where(s > tol, 1.0 / s, 0.0)
    return (Vt.T * s_inv) @ U.T   # V · Σ⁺ · Uᵀ

# ── Condition number ─────────────────────────────────────────────────────
def condition_number(A: np.ndarray) -> float:
    s = np.linalg.svd(A, compute_uv=False)
    return s[0] / s[-1] if s[-1] > 1e-14 else float('inf')

# ── Noise reduction via singular value hard-thresholding ─────────────────
def denoise(A_noisy: np.ndarray, threshold: float) -> np.ndarray:
    U, s, Vt = np.linalg.svd(A_noisy, full_matrices=False)
    s_thresh = np.where(s > threshold, s, 0.0)
    return (U * s_thresh) @ Vt

# Example: 50×50 matrix = rank-5 signal + Gaussian noise
rng = np.random.default_rng(0)
A_true = rng.randn(50, 5) @ rng.randn(5, 50)   # rank-5
A_noisy = A_true + 0.5 * rng.randn(50, 50)

s_noisy = np.linalg.svd(A_noisy, compute_uv=False)
A_clean = denoise(A_noisy, threshold=s_noisy[5])  # keep top 5

print(f"Recovery error: {np.linalg.norm(A_clean - A_true)/np.linalg.norm(A_true):.4f}")
Summary

SVD = the complete story of a linear map: $A = U\Sigma V^\top$. Right singular vectors $V$ span input space, left singular vectors $U$ span output space, and singular values $\Sigma$ are the stretch factors. Every important tool — PCA, pseudoinverse, low-rank approximation, condition number — flows from this single factorization.

The singular values always tell the truth: how much a matrix stretches, how close it is to singular, how much energy is in each "mode." Eigenvalues only tell that story for symmetric matrices. For everything else, reach for SVD.