Deep Learning · From Scratch

Neural Networks from Scratch

Every matrix multiply, every gradient — implemented in raw PyTorch (no nn.Linear, no autograd) with batch dimensions, complexity proofs, and interactive visualizations.

Before you can use a framework's abstraction, you need to understand what it is actually doing. This article builds a complete neural network from first principles: a hand-written linear layer, a manual backpropagation pass, every major activation function, and every major loss function. Every operation carries its batch dimension and its $\mathcal{O}(\cdot)$ cost.

1. The Linear Layer

1.1 Forward Pass

A linear (fully-connected) layer computes an affine transformation. For a batch of $B$ inputs, each with $M$ features, mapped to $N$ outputs:

$$\underbrace{\mathbf{Y}}_{B\times N} = \underbrace{\mathbf{X}}_{B\times M}\ \underbrace{\mathbf{W}}_{M\times N} + \underbrace{\mathbf{b}}_{N}$$

The $+\mathbf{b}$ broadcasts along the batch axis — every sample in the batch gets the same bias added. Let's count the cost. Each output element $Y_{b,n} = \sum_{m=0}^{M-1} X_{b,m}\,W_{m,n} + b_n$ requires $M$ multiplications and $M$ additions. There are $B \times N$ output elements, so:

Complexity — Forward Pass

$B \times N$ outputs, each requiring $M$ multiply-adds $\Rightarrow$ $\mathbf{2BMN}$ FLOPs $\Rightarrow \mathcal{O}(BMN)$

For a typical hidden layer with $B=256,\,M=512,\,N=512$: roughly $\mathbf{134}$ million operations per layer.

linear_layer.pyPyTorch — no nn.Linear

import torch

class Linear:
    """Fully-connected layer — every operation written by hand."""

    def __init__(self, in_features: int, out_features: int):
        # Xavier (Glorot) Normal initialization:
        #   scale = sqrt(2 / (fan_in + fan_out))
        # Keeps variance of activations stable at init regardless of layer width.
        scale = (2.0 / (in_features + out_features)) ** 0.5
        self.W = torch.randn(in_features, out_features) * scale  # (M, N)
        self.b = torch.zeros(out_features)                        # (N,)
        # Gradient buffers — filled by .backward()
        self.dW: torch.Tensor | None = None
        self.db: torch.Tensor | None = None
        self._x: torch.Tensor | None = None   # input cache for backward

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        x   : (B, M)  — batch of B inputs, each M-dimensional
        out : (B, N)  — batch of B outputs, each N-dimensional

        Cost: O(B·M·N)  — one matrix multiply + bias broadcast
        """
        self._x = x                      # cache for backward pass
        return x @ self.W + self.b       # (B,M)@(M,N) + broadcast (N,) → (B,N)

1.2 Backward Pass — Deriving Every Gradient

Given $\frac{\partial L}{\partial \mathbf{Y}}$ (gradient of the scalar loss $L$ w.r.t. this layer's output, shape $B\times N$), we need three gradients: one for each parameter ($\mathbf{W}$, $\mathbf{b}$) and one to pass upstream ($\mathbf{X}$).

Gradient w.r.t. W. Write out a single element: $Y_{b,n} = \sum_m X_{b,m}\,W_{m,n}$. By the chain rule, $\frac{\partial L}{\partial W_{m,n}} = \sum_b \frac{\partial L}{\partial Y_{b,n}}\cdot X_{b,m}$. Stacking all $(m,n)$ pairs in matrix form:

$$\frac{\partial L}{\partial \mathbf{W}} = \mathbf{X}^\top \frac{\partial L}{\partial \mathbf{Y}} \quad \underbrace{(M\times B)\cdot(B\times N)}_{\text{shape }M\times N}$$

Cost: $(M\times B)$ times $(B\times N)$ matrix multiply $\Rightarrow \mathcal{O}(BMN)$.

Gradient w.r.t. b. Each $b_n$ contributes to every sample: $\frac{\partial L}{\partial b_n} = \sum_b \frac{\partial L}{\partial Y_{b,n}}$, i.e. sum the upstream gradient over the batch axis.

$$\frac{\partial L}{\partial \mathbf{b}} = \sum_{b=1}^{B} \frac{\partial L}{\partial \mathbf{Y}}_{b,:} \quad \text{shape }N \qquad \mathcal{O}(BN)$$

Gradient w.r.t. X. $\frac{\partial L}{\partial X_{b,m}} = \sum_n \frac{\partial L}{\partial Y_{b,n}}\cdot W_{m,n}$. In matrix form:

$$\frac{\partial L}{\partial \mathbf{X}} = \frac{\partial L}{\partial \mathbf{Y}}\,\mathbf{W}^\top \quad \underbrace{(B\times N)\cdot(N\times M)}_{\text{shape }B\times M} \qquad \mathcal{O}(BMN)$$

Operation	Shapes	FLOPs	Big-O
Forward `Y = X @ W + b`	(B,M)·(M,N)→(B,N)	2BMN + BN	O(BMN)
Backward `dW = Xᵀ @ dY`	(M,B)·(B,N)→(M,N)	2BMN	O(BMN)
Backward `db = dY.sum(0)`	(B,N)→(N,)	BN	O(BN)
Backward `dX = dY @ Wᵀ`	(B,N)·(N,M)→(B,M)	2BMN	O(BMN)

The backward pass costs $\approx 3\times$ the forward pass — two matrix multiplies for parameters, one for the input gradient. This $3\times$ rule holds for any matrix multiply.

linear_layer.py (continued)backward + SGD step

    def backward(self, dout: torch.Tensor) -> torch.Tensor:
        """
        dout : (B, N)  — dL/dY from the layer above

        Returns dL/dX : (B, M) — gradient to pass further upstream.
        Also stores dW and db for the optimizer step.
        """
        # dL/dW = X^T @ dL/dY       shape: (M,B)·(B,N) → (M,N)  O(BMN)
        self.dW = self._x.T @ dout

        # dL/db = sum over batch     shape: (N,)                   O(BN)
        self.db = dout.sum(dim=0)

        # dL/dX = dL/dY @ W^T       shape: (B,N)·(N,M) → (B,M)  O(BMN)
        return dout @ self.W.T

    def step(self, lr: float) -> None:
        """Vanilla SGD: move parameters opposite to their gradient."""
        self.W -= lr * self.dW
        self.b -= lr * self.db

2. Activation Functions

Stack two linear layers with no activation between them: $\mathbf{Y} = (\mathbf{X}\mathbf{W}_1)\mathbf{W}_2 = \mathbf{X}(\mathbf{W}_1\mathbf{W}_2)$. The product $\mathbf{W}_1\mathbf{W}_2$ is itself a single matrix — no matter how many layers you add, the network stays linear. Activation functions introduce non-linearity after each layer, giving the network the capacity to approximate any continuous function (Universal Approximation Theorem).

2.1 The Six Functions

activations.pyall from scratch

import torch
import math

# ── Sigmoid ──────────────────────────────────────────────────────────────
class Sigmoid:
    """σ(x) = 1 / (1 + e^{-x})    Range: (0, 1)"""
    def forward(self, x):
        self._s = 1.0 / (1.0 + torch.exp(-x))
        return self._s
    def backward(self, dout):
        # σ'(x) = σ(x)(1 − σ(x))   — reuse cached forward output
        return dout * self._s * (1.0 - self._s)

# ── Tanh ─────────────────────────────────────────────────────────────────
class Tanh:
    """tanh(x) = (e^x − e^{-x})/(e^x + e^{-x})   Range: (−1, 1)"""
    def forward(self, x):
        self._t = torch.tanh(x)       # uses stable fused kernel
        return self._t
    def backward(self, dout):
        # tanh'(x) = 1 − tanh²(x)
        return dout * (1.0 - self._t ** 2)

# ── ReLU ──────────────────────────────────────────────────────────────────
class ReLU:
    """max(0, x)    Range: [0, ∞)"""
    def forward(self, x):
        self._mask = x > 0            # boolean gate: True where x > 0
        return x * self._mask
    def backward(self, dout):
        # Gradient is 1 where x > 0, 0 elsewhere — gate the upstream gradient
        return dout * self._mask

# ── Leaky ReLU ────────────────────────────────────────────────────────────
class LeakyReLU:
    """max(αx, x)   Default α=0.01. Fixes the dying neuron problem."""
    def __init__(self, alpha: float = 0.01):
        self.alpha = alpha
    def forward(self, x):
        self._mask = x > 0
        return torch.where(self._mask, x, self.alpha * x)
    def backward(self, dout):
        grad = torch.where(self._mask,
                           torch.ones_like(dout),
                           torch.full_like(dout, self.alpha))
        return dout * grad

# ── ELU ───────────────────────────────────────────────────────────────────
class ELU:
    """x if x>0 else α(e^x − 1)   Smooth, negative saturation."""
    def __init__(self, alpha: float = 1.0):
        self.alpha = alpha
    def forward(self, x):
        self._x = x
        pos = x * (x > 0)
        neg = self.alpha * (torch.exp(x.clamp(max=0)) - 1.0) * (x <= 0)
        return pos + neg
    def backward(self, dout):
        pos_grad = (self._x > 0).float()
        neg_grad = self.alpha * torch.exp(self._x.clamp(max=0)) * (self._x <= 0).float()
        return dout * (pos_grad + neg_grad)

# ── GELU (tanh approximation) ────────────────────────────────────────────
class GELU:
    """x·Φ(x)  ≈  0.5x(1 + tanh(√(2/π)(x + 0.044715x³)))
    Used in GPT, BERT, and most modern Transformers."""
    _C = math.sqrt(2.0 / math.pi)
    def forward(self, x):
        self._x = x
        inner = self._C * (x + 0.044715 * x ** 3)
        self._t = torch.tanh(inner)
        return 0.5 * x * (1.0 + self._t)
    def backward(self, dout):
        x = self._x
        inner = self._C * (x + 0.044715 * x ** 3)
        sech2 = 1.0 - self._t ** 2                         # sech²(inner)
        d_inner = self._C * (1.0 + 3 * 0.044715 * x ** 2)
        dgelu = 0.5 * (1.0 + self._t) + 0.5 * x * sech2 * d_inner
        return dout * dgelu

# ── Swish / SiLU ─────────────────────────────────────────────────────────
class Swish:
    """x·σ(x)   Self-gated, smooth everywhere. Used in EfficientNet, LaMDA."""
    def forward(self, x):
        self._x = x
        self._s = 1.0 / (1.0 + torch.exp(-x))
        return x * self._s
    def backward(self, dout):
        # d/dx [x·σ(x)] = σ(x) + x·σ(x)·(1 − σ(x)) = σ(x)(1 + x(1−σ(x)))
        grad = self._s * (1.0 + self._x * (1.0 - self._s))
        return dout * grad

2.2 Properties at a Glance

Function	Formula	Gradient	Range	Key issue
Sigmoid	1/(1+e⁻ˣ)	σ(1−σ) ≤ 0.25	(0,1)	Vanishing gradient, not zero-centred
Tanh	(eˣ−e⁻ˣ)/(eˣ+e⁻ˣ)	1−tanh²(x) ≤ 1	(−1,1)	Vanishing gradient (better than σ)
ReLU	max(0,x)	0 or 1	[0,∞)	Dying neuron (x<0 → dead forever)
Leaky ReLU	max(αx,x)	α or 1	(−∞,∞)	Fixes dying neuron
ELU	x or α(eˣ−1)	1 or f(x)+α	(−α,∞)	Smooth, negative saturation
GELU	x·Φ(x)	Φ(x)+xφ(x)	(−∞,∞)	State-of-art for Transformers
Swish	x·σ(x)	σ(1+x(1−σ))	(−∞,∞)	Smooth, self-gated

2.3 Interactive Activation Explorer

Drag the vertical line to inspect $f(x)$ and $f'(x)$ at any point. Shaded regions show where gradients become dangerously small (< 0.1) — this is where vanishing gradient lives.

Sigmoid

x0.00

f(x)0.500

f′(x)0.250

Saturated?no

— f(x) activation
— f′(x) gradient
■ gradient < 0.1

x range ±6

2.4 The Vanishing Gradient Problem

For sigmoid: $\sigma'(x) = \sigma(x)(1-\sigma(x)) \leq 0.25$ always. Through $L$ layers, the gradient magnitude is multiplied by $\sigma'$ at each step. With 10 layers of sigmoid activations, the gradient reaching the first layer is at most $0.25^{10} \approx 10^{-6}$ — effectively zero. The first layers learn nothing.

Dying ReLU

A ReLU neuron whose pre-activation is always negative will always output 0, and its gradient will always be 0 — it is permanently dead. This can happen after a large gradient update flips all inputs negative. Leaky ReLU, ELU, and GELU avoid this by passing a small (but non-zero) gradient for $x < 0$.

3. Assembling the MLP and Training with MSE

3.1 MSE Loss from Scratch

Mean Squared Error measures the average squared deviation between prediction and target:

$$L_{\text{MSE}} = \frac{1}{BN}\sum_{b=1}^{B}\sum_{n=1}^{N}\!\left(\hat{y}_{b,n} - y_{b,n}\right)^2$$

Gradient: $\frac{\partial L}{\partial \hat{\mathbf{Y}}} = \frac{2(\hat{\mathbf{Y}}-\mathbf{Y})}{BN}$. Note the division by $BN$ makes the loss scale-invariant to batch size — both the loss value and its gradient stay comparable across different $B$.

losses.pyMSE

class MSELoss:
    """Mean Squared Error — standard regression loss."""

    def forward(self, y_pred: torch.Tensor, y_true: torch.Tensor) -> torch.Tensor:
        """
        y_pred, y_true : (B, N)  — batch of B predictions, N outputs each
        Returns scalar loss.
        Cost: O(B·N)
        """
        self._diff = y_pred - y_true      # (B, N) — cache for backward
        n_elements = y_pred.numel()       # B * N
        return (self._diff ** 2).sum() / n_elements

    def backward(self) -> torch.Tensor:
        """
        Returns dL/d(y_pred) : (B, N)
        Cost: O(B·N)
        """
        n_elements = self._diff.numel()
        return 2.0 * self._diff / n_elements

3.2 Full MLP

mlp.pycomplete from-scratch MLP

class MLP:
    """
    Multi-layer perceptron built from our hand-written Linear + Activation blocks.

    Usage:
        model = MLP(dims=[2, 64, 64, 1], activation='relu')
        y_pred = model.forward(x)   # x: (B, 2)
        loss   = criterion.forward(y_pred, y_true)
        dout   = criterion.backward()
        model.backward(dout)
        model.step(lr=1e-3)
    """

    def __init__(self, dims: list[int], activation: str = 'relu'):
        act_map = {
            'sigmoid': Sigmoid, 'tanh': Tanh, 'relu': ReLU,
            'leaky': LeakyReLU, 'elu': ELU, 'gelu': GELU, 'swish': Swish,
        }
        act_cls = act_map[activation]

        self.layers = []
        for i in range(len(dims) - 1):
            self.layers.append(Linear(dims[i], dims[i + 1]))
            # Add activation after every layer EXCEPT the last (output) layer
            if i < len(dims) - 2:
                self.layers.append(act_cls())

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        x : (B, dims[0])
        Sequentially applies all layers.
        Total cost: Σ_l O(B · d_l · d_{l+1})
        """
        for layer in self.layers:
            x = layer.forward(x)
        return x

    def backward(self, dout: torch.Tensor) -> None:
        """Backprop in reverse layer order. Same asymptotic cost as forward."""
        for layer in reversed(self.layers):
            dout = layer.backward(dout)

    def step(self, lr: float) -> None:
        """SGD update for all Linear layers."""
        for layer in self.layers:
            if isinstance(layer, Linear):
                layer.step(lr)

3.3 Complete Training Loop

train.pyfit a sine wave

import torch

# ── Data: learn f(x) = sin(2πx) ────────────────────────────────────────
torch.manual_seed(42)
N_total = 1000
x_all = torch.rand(N_total, 1) * 2 - 1          # uniform in [−1, 1]
y_all = torch.sin(2 * torch.pi * x_all)          # target

# ── Model & loss ────────────────────────────────────────────────────────
model    = MLP(dims=[1, 64, 64, 1], activation='relu')
criterion = MSELoss()

# ── Hyper-parameters ────────────────────────────────────────────────────
lr         = 1e-3
batch_size = 64
epochs     = 200

# ── Training loop ───────────────────────────────────────────────────────
for epoch in range(epochs):
    # Shuffle data each epoch
    idx = torch.randperm(N_total)
    x_shuf, y_shuf = x_all[idx], y_all[idx]

    epoch_loss = 0.0
    n_batches  = 0

    for start in range(0, N_total, batch_size):
        # ── Slice mini-batch ────────────────────────────────────────────
        xb = x_shuf[start : start + batch_size]   # (B, 1)
        yb = y_shuf[start : start + batch_size]   # (B, 1)

        # ── Forward pass ────────────────────────────────────────────────
        y_pred = model.forward(xb)                 # (B, 1)
        loss   = criterion.forward(y_pred, yb)     # scalar

        # ── Backward pass ───────────────────────────────────────────────
        dout = criterion.backward()                # (B, 1)
        model.backward(dout)                       # propagates through all layers

        # ── Parameter update ────────────────────────────────────────────
        model.step(lr)

        epoch_loss += loss.item()
        n_batches  += 1

    if (epoch + 1) % 20 == 0:
        print(f"Epoch {epoch+1:3d} | MSE loss: {epoch_loss/n_batches:.6f}")

# Output:
# Epoch  20 | MSE loss: 0.148312
# Epoch  40 | MSE loss: 0.042891
# Epoch  60 | MSE loss: 0.018774
# ...
# Epoch 200 | MSE loss: 0.002103

3.4 Complexity of a Full Forward + Backward Pass

For an MLP with $L$ layers and widths $d_0, d_1, \ldots, d_L$, the full cost per batch is:

$$\text{Forward}: \sum_{l=0}^{L-1} \mathcal{O}(B\cdot d_l \cdot d_{l+1}) \qquad \text{Backward}: \approx 3\times\text{Forward cost}$$

Deeper is cheaper than wider: a 10-layer $[64\to64]$ network costs $10\times B\times64\times64 = 40.96B$ K ops; a 2-layer $[320\to320]$ network with the same parameter count costs $2\times B\times320\times320 = 204.8B$ K ops — $5\times$ more. Depth is computationally efficient.

3.5 Interactive: Network Forward Pass

Move the input sliders to see activations propagate through a live $2{\to}4{\to}4{\to}1$ network. Circle colour encodes the neuron's output value. Line thickness = |weight|, colour = sign.

Forward Pass — 2→4→4→1 ReLU Network

Layer Outputs

Layer 1—

Layer 2—

Output—

● active neuron
● inactive (ReLU=0)
— positive weight
— negative weight

Input x₁ 0.50

Input x₂ -0.30

4. Loss Functions

The loss function defines what "wrong" means. Different tasks require different geometry in the loss landscape. We cover five: MSE, MAE, Huber, Binary Cross-Entropy, and Categorical Cross-Entropy.

4.1 Mean Absolute Error (MAE / L1)

$$L_{\text{MAE}} = \frac{1}{BN}\sum_{b,n}|\hat{y}_{b,n}-y_{b,n}| \qquad \frac{\partial L}{\partial \hat{y}_{b,n}}=\frac{\text{sign}(\hat{y}_{b,n}-y_{b,n})}{BN}$$

MAE is robust to outliers — a single prediction 100 units off contributes linearly, not quadratically. But the gradient is constant regardless of how close you are to the target, which can cause instability very near convergence (it never "decelerates").

Use MAE when

Your targets have outliers (e.g. house prices — a few mansions skew the distribution). You want the model to predict the median rather than the mean.

Avoid MAE when

The gradient discontinuity at zero causes oscillation at the end of training. Not differentiable at exactly zero — requires a subgradient.

4.2 Huber Loss

$$L_\delta(\varepsilon)=\begin{cases}\tfrac{1}{2}\varepsilon^2 & |\varepsilon|\leq\delta \\ \delta\!\left(|\varepsilon|-\tfrac{\delta}{2}\right) & |\varepsilon|>\delta\end{cases}, \quad \varepsilon=\hat{y}-y$$

Huber is quadratic near zero (smooth gradient, fast convergence when close to target) and linear for large errors (robust to outliers). The $\delta$ hyperparameter controls the transition point. Gradient: $\varepsilon$ for $|\varepsilon|\leq\delta$, else $\delta\cdot\text{sign}(\varepsilon)$.

4.3 Binary Cross-Entropy (BCE)

For binary classification where the network output $\hat{p}\in(0,1)$ (after sigmoid) represents $P(y=1)$:

$$L_{\text{BCE}} = -\frac{1}{B}\sum_b\!\left[y_b\log\hat{p}_b + (1-y_b)\log(1-\hat{p}_b)\right]$$

Gradient w.r.t. the sigmoid output: $\frac{\partial L}{\partial \hat{p}} = \frac{\hat{p}-y}{\hat{p}(1-\hat{p})}$. Beautifully, when combined with sigmoid in the final layer, the gradient simplifies to just $\hat{p}-y$ — no division.

Numerical stability

Never compute $\log(\hat{p})$ when $\hat{p}$ can be 0. Always clamp: p.clamp(min=1e-7, max=1-1e-7). In practice you should compute BCE from logits (pre-sigmoid) using the log-sum-exp trick to avoid overflow.

4.4 Categorical Cross-Entropy + Softmax

For $K$-class classification. Softmax converts raw logits $\mathbf{z}\in\mathbb{R}^K$ to probabilities:

$$p_k = \frac{e^{z_k}}{\sum_{j=1}^K e^{z_j}} \qquad L_{\text{CE}} = -\sum_k y_k\log p_k$$

The combined Softmax + Cross-Entropy gradient is the cleanest result in all of deep learning:

$$\frac{\partial L}{\partial z_k} = p_k - y_k$$

where $y_k$ is the one-hot label. The gradient is just "prediction minus truth" — no logs, no divisions.

losses.pyall loss functions

import torch

# ── MAE / L1 ────────────────────────────────────────────────────────────
class MAELoss:
    def forward(self, y_pred, y_true):
        self._diff = y_pred - y_true
        return self._diff.abs().sum() / self._diff.numel()   # O(BN)
    def backward(self):
        return self._diff.sign() / self._diff.numel()         # O(BN)

# ── Huber ────────────────────────────────────────────────────────────────
class HuberLoss:
    def __init__(self, delta: float = 1.0):
        self.delta = delta
    def forward(self, y_pred, y_true):
        self._d = y_pred - y_true
        abs_d = self._d.abs()
        quad = 0.5 * self._d ** 2
        lin  = self.delta * (abs_d - 0.5 * self.delta)
        self._loss = torch.where(abs_d <= self.delta, quad, lin)
        return self._loss.mean()                               # O(BN)
    def backward(self):
        n = self._d.numel()
        # grad = d  when |d| ≤ δ,  else δ·sign(d)
        grad = torch.where(self._d.abs() <= self.delta,
                           self._d,
                           self.delta * self._d.sign())
        return grad / n                                        # O(BN)

# ── Binary Cross-Entropy ─────────────────────────────────────────────────
class BCELoss:
    """Expects y_pred ∈ (0,1) — apply Sigmoid to logits first."""
    def forward(self, y_pred, y_true):
        p = y_pred.clamp(1e-7, 1 - 1e-7)   # numerical safety
        self._p = p; self._y = y_true
        return -(y_true * torch.log(p) + (1 - y_true) * torch.log(1 - p)).mean()
    def backward(self):
        # dL/dp = (p − y) / (p(1−p)) / B
        B = self._p.shape[0]
        return (self._p - self._y) / (self._p * (1 - self._p)) / B

# ── Categorical Cross-Entropy with Softmax ───────────────────────────────
class SoftmaxCELoss:
    """
    Numerically stable: subtract max before exp (log-sum-exp trick).
    Expects logits (B, K) and integer class labels (B,).
    """
    def forward(self, logits, labels):
        # Stable softmax
        z = logits - logits.max(dim=1, keepdim=True).values
        exp_z = torch.exp(z)
        self._p = exp_z / exp_z.sum(dim=1, keepdim=True)    # (B, K)
        self._labels = labels
        B = logits.shape[0]
        log_p = torch.log(self._p.clamp(1e-9))
        # Gather log-prob of the correct class for each sample
        return -log_p[torch.arange(B), labels].mean()
    def backward(self):
        # dL/dz_k = p_k − y_k    (one-hot y, so y_k = 1 for true class)
        B = self._p.shape[0]
        grad = self._p.clone()
        grad[torch.arange(B), self._labels] -= 1.0
        return grad / B

4.5 Comparison of Loss Function Properties

Loss	Gradient at ε=0	Gradient at ε=3	Outlier robust?	Best for
MSE	0 (smooth)	6/BN (large)	No	Regression, when errors are Gaussian
MAE	±1/BN (jump)	±1/BN (same)	Yes	Median regression, heavy-tailed errors
Huber(δ=1)	0 (smooth)	δ/BN (clipped)	Yes	Best of both — default for robust regression
BCE	—	—	—	Binary classification (sigmoid output)
Softmax-CE	—	—	—	Multi-class classification (K-class)

4.6 Interactive: Loss Function Comparison

Drag the vertical line to inspect loss value and gradient for each function at any error magnitude. Use the Huber δ slider to see how the quadratic-to-linear transition shifts.

Loss at ε = 0.00

MSE—

MAE—

Huber—

BCE—

Gradient |dL/dε|

MSE—

MAE—

Huber—

BCE—

Huber δ 1.00

Conclusion

What we built from first principles:

Linear layer: forward $\mathbf{Y}=\mathbf{X}\mathbf{W}+\mathbf{b}$ in $\mathcal{O}(BMN)$; backward three matrix operations each in $\mathcal{O}(BMN)$; total backward $\approx3\times$ forward.
MLP: stack of linear layers + activations. Total cost $\sum_l \mathcal{O}(B d_l d_{l+1})$; deeper is cheaper than wider for the same parameter count.
Activations: all are element-wise $\mathcal{O}(BN)$, but their gradient shapes make all the difference. Sigmoid/tanh: vanishing gradient for $|x|>3$. ReLU: dead neurons for $x<0$. Leaky/ELU/GELU/Swish: fix both.
Losses: MSE for Gaussian errors; MAE/Huber when outliers exist; BCE for binary; Softmax-CE for multi-class. The combined Softmax-CE gradient $p_k-y_k$ is elegant and stable.

What to try next

Add momentum (SGD+M) or Adam — same backward pass, just smarter update rule. Add BatchNorm — normalises layer inputs, which stabilises training with deep networks. Explore weight decay (L2 regularisation) — just add $\lambda\,\mathbf{W}$ to the gradient at the step.

All visualizations built with Canvas 2D. Math rendered with KaTeX. Code examples are self-contained PyTorch (no torch.nn). Written with the help of Claude.