← LearningMyself

Neural Networks from Scratch

Every matrix multiply, every gradient — implemented in raw PyTorch (no nn.Linear, no autograd) with batch dimensions, complexity proofs, and interactive visualizations.

Before you can use a framework's abstraction, you need to understand what it is actually doing. This article builds a complete neural network from first principles: a hand-written linear layer, a manual backpropagation pass, every major activation function, and every major loss function. Every operation carries its batch dimension and its $\mathcal{O}(\cdot)$ cost.

1. The Linear Layer

1.1 Forward Pass

A linear (fully-connected) layer computes an affine transformation. For a batch of $B$ inputs, each with $M$ features, mapped to $N$ outputs:

$$\underbrace{\mathbf{Y}}_{B\times N} = \underbrace{\mathbf{X}}_{B\times M}\ \underbrace{\mathbf{W}}_{M\times N} + \underbrace{\mathbf{b}}_{N}$$

The $+\mathbf{b}$ broadcasts along the batch axis — every sample in the batch gets the same bias added. Let's count the cost. Each output element $Y_{b,n} = \sum_{m=0}^{M-1} X_{b,m}\,W_{m,n} + b_n$ requires $M$ multiplications and $M$ additions. There are $B \times N$ output elements, so:

Complexity — Forward Pass

$B \times N$ outputs, each requiring $M$ multiply-adds $\Rightarrow$ $\mathbf{2BMN}$ FLOPs $\Rightarrow \mathcal{O}(BMN)$

For a typical hidden layer with $B=256,\,M=512,\,N=512$: roughly $\mathbf{134}$ million operations per layer.

linear_layer.pyPyTorch — no nn.Linear
import torch

class Linear:
    """Fully-connected layer — every operation written by hand."""

    def __init__(self, in_features: int, out_features: int):
        # Xavier (Glorot) Normal initialization:
        #   scale = sqrt(2 / (fan_in + fan_out))
        # Keeps variance of activations stable at init regardless of layer width.
        scale = (2.0 / (in_features + out_features)) ** 0.5
        self.W = torch.randn(in_features, out_features) * scale  # (M, N)
        self.b = torch.zeros(out_features)                        # (N,)
        # Gradient buffers — filled by .backward()
        self.dW: torch.Tensor | None = None
        self.db: torch.Tensor | None = None
        self._x: torch.Tensor | None = None   # input cache for backward

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        x   : (B, M)  — batch of B inputs, each M-dimensional
        out : (B, N)  — batch of B outputs, each N-dimensional

        Cost: O(B·M·N)  — one matrix multiply + bias broadcast
        """
        self._x = x                      # cache for backward pass
        return x @ self.W + self.b       # (B,M)@(M,N) + broadcast (N,) → (B,N)

1.2 Backward Pass — Deriving Every Gradient

Given $\frac{\partial L}{\partial \mathbf{Y}}$ (gradient of the scalar loss $L$ w.r.t. this layer's output, shape $B\times N$), we need three gradients: one for each parameter ($\mathbf{W}$, $\mathbf{b}$) and one to pass upstream ($\mathbf{X}$).

Gradient w.r.t. W. Write out a single element: $Y_{b,n} = \sum_m X_{b,m}\,W_{m,n}$. By the chain rule, $\frac{\partial L}{\partial W_{m,n}} = \sum_b \frac{\partial L}{\partial Y_{b,n}}\cdot X_{b,m}$. Stacking all $(m,n)$ pairs in matrix form:

$$\frac{\partial L}{\partial \mathbf{W}} = \mathbf{X}^\top \frac{\partial L}{\partial \mathbf{Y}} \quad \underbrace{(M\times B)\cdot(B\times N)}_{\text{shape }M\times N}$$

Cost: $(M\times B)$ times $(B\times N)$ matrix multiply $\Rightarrow \mathcal{O}(BMN)$.

Gradient w.r.t. b. Each $b_n$ contributes to every sample: $\frac{\partial L}{\partial b_n} = \sum_b \frac{\partial L}{\partial Y_{b,n}}$, i.e. sum the upstream gradient over the batch axis.

$$\frac{\partial L}{\partial \mathbf{b}} = \sum_{b=1}^{B} \frac{\partial L}{\partial \mathbf{Y}}_{b,:} \quad \text{shape }N \qquad \mathcal{O}(BN)$$

Gradient w.r.t. X. $\frac{\partial L}{\partial X_{b,m}} = \sum_n \frac{\partial L}{\partial Y_{b,n}}\cdot W_{m,n}$. In matrix form:

$$\frac{\partial L}{\partial \mathbf{X}} = \frac{\partial L}{\partial \mathbf{Y}}\,\mathbf{W}^\top \quad \underbrace{(B\times N)\cdot(N\times M)}_{\text{shape }B\times M} \qquad \mathcal{O}(BMN)$$
OperationShapesFLOPsBig-O
Forward Y = X @ W + b(B,M)·(M,N)→(B,N)2BMN + BNO(BMN)
Backward dW = Xᵀ @ dY(M,B)·(B,N)→(M,N)2BMNO(BMN)
Backward db = dY.sum(0)(B,N)→(N,)BNO(BN)
Backward dX = dY @ Wᵀ(B,N)·(N,M)→(B,M)2BMNO(BMN)

The backward pass costs $\approx 3\times$ the forward pass — two matrix multiplies for parameters, one for the input gradient. This $3\times$ rule holds for any matrix multiply.

linear_layer.py (continued)backward + SGD step
    def backward(self, dout: torch.Tensor) -> torch.Tensor:
        """
        dout : (B, N)  — dL/dY from the layer above

        Returns dL/dX : (B, M) — gradient to pass further upstream.
        Also stores dW and db for the optimizer step.
        """
        # dL/dW = X^T @ dL/dY       shape: (M,B)·(B,N) → (M,N)  O(BMN)
        self.dW = self._x.T @ dout

        # dL/db = sum over batch     shape: (N,)                   O(BN)
        self.db = dout.sum(dim=0)

        # dL/dX = dL/dY @ W^T       shape: (B,N)·(N,M) → (B,M)  O(BMN)
        return dout @ self.W.T

    def step(self, lr: float) -> None:
        """Vanilla SGD: move parameters opposite to their gradient."""
        self.W -= lr * self.dW
        self.b -= lr * self.db

2. Activation Functions

Stack two linear layers with no activation between them: $\mathbf{Y} = (\mathbf{X}\mathbf{W}_1)\mathbf{W}_2 = \mathbf{X}(\mathbf{W}_1\mathbf{W}_2)$. The product $\mathbf{W}_1\mathbf{W}_2$ is itself a single matrix — no matter how many layers you add, the network stays linear. Activation functions introduce non-linearity after each layer, giving the network the capacity to approximate any continuous function (Universal Approximation Theorem).

2.1 The Six Functions

activations.pyall from scratch
import torch
import math

# ── Sigmoid ──────────────────────────────────────────────────────────────
class Sigmoid:
    """σ(x) = 1 / (1 + e^{-x})    Range: (0, 1)"""
    def forward(self, x):
        self._s = 1.0 / (1.0 + torch.exp(-x))
        return self._s
    def backward(self, dout):
        # σ'(x) = σ(x)(1 − σ(x))   — reuse cached forward output
        return dout * self._s * (1.0 - self._s)

# ── Tanh ─────────────────────────────────────────────────────────────────
class Tanh:
    """tanh(x) = (e^x − e^{-x})/(e^x + e^{-x})   Range: (−1, 1)"""
    def forward(self, x):
        self._t = torch.tanh(x)       # uses stable fused kernel
        return self._t
    def backward(self, dout):
        # tanh'(x) = 1 − tanh²(x)
        return dout * (1.0 - self._t ** 2)

# ── ReLU ──────────────────────────────────────────────────────────────────
class ReLU:
    """max(0, x)    Range: [0, ∞)"""
    def forward(self, x):
        self._mask = x > 0            # boolean gate: True where x > 0
        return x * self._mask
    def backward(self, dout):
        # Gradient is 1 where x > 0, 0 elsewhere — gate the upstream gradient
        return dout * self._mask

# ── Leaky ReLU ────────────────────────────────────────────────────────────
class LeakyReLU:
    """max(αx, x)   Default α=0.01. Fixes the dying neuron problem."""
    def __init__(self, alpha: float = 0.01):
        self.alpha = alpha
    def forward(self, x):
        self._mask = x > 0
        return torch.where(self._mask, x, self.alpha * x)
    def backward(self, dout):
        grad = torch.where(self._mask,
                           torch.ones_like(dout),
                           torch.full_like(dout, self.alpha))
        return dout * grad

# ── ELU ───────────────────────────────────────────────────────────────────
class ELU:
    """x if x>0 else α(e^x − 1)   Smooth, negative saturation."""
    def __init__(self, alpha: float = 1.0):
        self.alpha = alpha
    def forward(self, x):
        self._x = x
        pos = x * (x > 0)
        neg = self.alpha * (torch.exp(x.clamp(max=0)) - 1.0) * (x <= 0)
        return pos + neg
    def backward(self, dout):
        pos_grad = (self._x > 0).float()
        neg_grad = self.alpha * torch.exp(self._x.clamp(max=0)) * (self._x <= 0).float()
        return dout * (pos_grad + neg_grad)

# ── GELU (tanh approximation) ────────────────────────────────────────────
class GELU:
    """x·Φ(x)  ≈  0.5x(1 + tanh(√(2/π)(x + 0.044715x³)))
    Used in GPT, BERT, and most modern Transformers."""
    _C = math.sqrt(2.0 / math.pi)
    def forward(self, x):
        self._x = x
        inner = self._C * (x + 0.044715 * x ** 3)
        self._t = torch.tanh(inner)
        return 0.5 * x * (1.0 + self._t)
    def backward(self, dout):
        x = self._x
        inner = self._C * (x + 0.044715 * x ** 3)
        sech2 = 1.0 - self._t ** 2                         # sech²(inner)
        d_inner = self._C * (1.0 + 3 * 0.044715 * x ** 2)
        dgelu = 0.5 * (1.0 + self._t) + 0.5 * x * sech2 * d_inner
        return dout * dgelu

# ── Swish / SiLU ─────────────────────────────────────────────────────────
class Swish:
    """x·σ(x)   Self-gated, smooth everywhere. Used in EfficientNet, LaMDA."""
    def forward(self, x):
        self._x = x
        self._s = 1.0 / (1.0 + torch.exp(-x))
        return x * self._s
    def backward(self, dout):
        # d/dx [x·σ(x)] = σ(x) + x·σ(x)·(1 − σ(x)) = σ(x)(1 + x(1−σ(x)))
        grad = self._s * (1.0 + self._x * (1.0 - self._s))
        return dout * grad

2.2 Properties at a Glance

FunctionFormulaGradientRangeKey issue
Sigmoid1/(1+e⁻ˣ)σ(1−σ) ≤ 0.25(0,1)Vanishing gradient, not zero-centred
Tanh(eˣ−e⁻ˣ)/(eˣ+e⁻ˣ)1−tanh²(x) ≤ 1(−1,1)Vanishing gradient (better than σ)
ReLUmax(0,x)0 or 1[0,∞)Dying neuron (x<0 → dead forever)
Leaky ReLUmax(αx,x)α or 1(−∞,∞)Fixes dying neuron
ELUx or α(eˣ−1)1 or f(x)+α(−α,∞)Smooth, negative saturation
GELUx·Φ(x)Φ(x)+xφ(x)(−∞,∞)State-of-art for Transformers
Swishx·σ(x)σ(1+x(1−σ))(−∞,∞)Smooth, self-gated

2.3 Interactive Activation Explorer

Drag the vertical line to inspect $f(x)$ and $f'(x)$ at any point. Shaded regions show where gradients become dangerously small (< 0.1) — this is where vanishing gradient lives.

Sigmoid
x0.00
f(x)0.500
f′(x)0.250
Saturated?no
f(x) activation
f′(x) gradient
gradient < 0.1

2.4 The Vanishing Gradient Problem

For sigmoid: $\sigma'(x) = \sigma(x)(1-\sigma(x)) \leq 0.25$ always. Through $L$ layers, the gradient magnitude is multiplied by $\sigma'$ at each step. With 10 layers of sigmoid activations, the gradient reaching the first layer is at most $0.25^{10} \approx 10^{-6}$ — effectively zero. The first layers learn nothing.

Dying ReLU

A ReLU neuron whose pre-activation is always negative will always output 0, and its gradient will always be 0 — it is permanently dead. This can happen after a large gradient update flips all inputs negative. Leaky ReLU, ELU, and GELU avoid this by passing a small (but non-zero) gradient for $x < 0$.

3. Assembling the MLP and Training with MSE

3.1 MSE Loss from Scratch

Mean Squared Error measures the average squared deviation between prediction and target:

$$L_{\text{MSE}} = \frac{1}{BN}\sum_{b=1}^{B}\sum_{n=1}^{N}\!\left(\hat{y}_{b,n} - y_{b,n}\right)^2$$

Gradient: $\frac{\partial L}{\partial \hat{\mathbf{Y}}} = \frac{2(\hat{\mathbf{Y}}-\mathbf{Y})}{BN}$. Note the division by $BN$ makes the loss scale-invariant to batch size — both the loss value and its gradient stay comparable across different $B$.

losses.pyMSE
class MSELoss:
    """Mean Squared Error — standard regression loss."""

    def forward(self, y_pred: torch.Tensor, y_true: torch.Tensor) -> torch.Tensor:
        """
        y_pred, y_true : (B, N)  — batch of B predictions, N outputs each
        Returns scalar loss.
        Cost: O(B·N)
        """
        self._diff = y_pred - y_true      # (B, N) — cache for backward
        n_elements = y_pred.numel()       # B * N
        return (self._diff ** 2).sum() / n_elements

    def backward(self) -> torch.Tensor:
        """
        Returns dL/d(y_pred) : (B, N)
        Cost: O(B·N)
        """
        n_elements = self._diff.numel()
        return 2.0 * self._diff / n_elements

3.2 Full MLP

mlp.pycomplete from-scratch MLP
class MLP:
    """
    Multi-layer perceptron built from our hand-written Linear + Activation blocks.

    Usage:
        model = MLP(dims=[2, 64, 64, 1], activation='relu')
        y_pred = model.forward(x)   # x: (B, 2)
        loss   = criterion.forward(y_pred, y_true)
        dout   = criterion.backward()
        model.backward(dout)
        model.step(lr=1e-3)
    """

    def __init__(self, dims: list[int], activation: str = 'relu'):
        act_map = {
            'sigmoid': Sigmoid, 'tanh': Tanh, 'relu': ReLU,
            'leaky': LeakyReLU, 'elu': ELU, 'gelu': GELU, 'swish': Swish,
        }
        act_cls = act_map[activation]

        self.layers = []
        for i in range(len(dims) - 1):
            self.layers.append(Linear(dims[i], dims[i + 1]))
            # Add activation after every layer EXCEPT the last (output) layer
            if i < len(dims) - 2:
                self.layers.append(act_cls())

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        x : (B, dims[0])
        Sequentially applies all layers.
        Total cost: Σ_l O(B · d_l · d_{l+1})
        """
        for layer in self.layers:
            x = layer.forward(x)
        return x

    def backward(self, dout: torch.Tensor) -> None:
        """Backprop in reverse layer order. Same asymptotic cost as forward."""
        for layer in reversed(self.layers):
            dout = layer.backward(dout)

    def step(self, lr: float) -> None:
        """SGD update for all Linear layers."""
        for layer in self.layers:
            if isinstance(layer, Linear):
                layer.step(lr)

3.3 Complete Training Loop

train.pyfit a sine wave
import torch

# ── Data: learn f(x) = sin(2πx) ────────────────────────────────────────
torch.manual_seed(42)
N_total = 1000
x_all = torch.rand(N_total, 1) * 2 - 1          # uniform in [−1, 1]
y_all = torch.sin(2 * torch.pi * x_all)          # target

# ── Model & loss ────────────────────────────────────────────────────────
model    = MLP(dims=[1, 64, 64, 1], activation='relu')
criterion = MSELoss()

# ── Hyper-parameters ────────────────────────────────────────────────────
lr         = 1e-3
batch_size = 64
epochs     = 200

# ── Training loop ───────────────────────────────────────────────────────
for epoch in range(epochs):
    # Shuffle data each epoch
    idx = torch.randperm(N_total)
    x_shuf, y_shuf = x_all[idx], y_all[idx]

    epoch_loss = 0.0
    n_batches  = 0

    for start in range(0, N_total, batch_size):
        # ── Slice mini-batch ────────────────────────────────────────────
        xb = x_shuf[start : start + batch_size]   # (B, 1)
        yb = y_shuf[start : start + batch_size]   # (B, 1)

        # ── Forward pass ────────────────────────────────────────────────
        y_pred = model.forward(xb)                 # (B, 1)
        loss   = criterion.forward(y_pred, yb)     # scalar

        # ── Backward pass ───────────────────────────────────────────────
        dout = criterion.backward()                # (B, 1)
        model.backward(dout)                       # propagates through all layers

        # ── Parameter update ────────────────────────────────────────────
        model.step(lr)

        epoch_loss += loss.item()
        n_batches  += 1

    if (epoch + 1) % 20 == 0:
        print(f"Epoch {epoch+1:3d} | MSE loss: {epoch_loss/n_batches:.6f}")

# Output:
# Epoch  20 | MSE loss: 0.148312
# Epoch  40 | MSE loss: 0.042891
# Epoch  60 | MSE loss: 0.018774
# ...
# Epoch 200 | MSE loss: 0.002103

3.4 Complexity of a Full Forward + Backward Pass

For an MLP with $L$ layers and widths $d_0, d_1, \ldots, d_L$, the full cost per batch is:

$$\text{Forward}: \sum_{l=0}^{L-1} \mathcal{O}(B\cdot d_l \cdot d_{l+1}) \qquad \text{Backward}: \approx 3\times\text{Forward cost}$$

Deeper is cheaper than wider: a 10-layer $[64\to64]$ network costs $10\times B\times64\times64 = 40.96B$ K ops; a 2-layer $[320\to320]$ network with the same parameter count costs $2\times B\times320\times320 = 204.8B$ K ops — $5\times$ more. Depth is computationally efficient.

3.5 Interactive: Network Forward Pass

Move the input sliders to see activations propagate through a live $2{\to}4{\to}4{\to}1$ network. Circle colour encodes the neuron's output value. Line thickness = |weight|, colour = sign.

Forward Pass — 2→4→4→1 ReLU Network
Layer Outputs
Layer 1
Layer 2
Output
active neuron
inactive (ReLU=0)
positive weight
negative weight

4. Loss Functions

The loss function defines what "wrong" means. Different tasks require different geometry in the loss landscape. We cover five: MSE, MAE, Huber, Binary Cross-Entropy, and Categorical Cross-Entropy.

4.1 Mean Absolute Error (MAE / L1)

$$L_{\text{MAE}} = \frac{1}{BN}\sum_{b,n}|\hat{y}_{b,n}-y_{b,n}| \qquad \frac{\partial L}{\partial \hat{y}_{b,n}}=\frac{\text{sign}(\hat{y}_{b,n}-y_{b,n})}{BN}$$

MAE is robust to outliers — a single prediction 100 units off contributes linearly, not quadratically. But the gradient is constant regardless of how close you are to the target, which can cause instability very near convergence (it never "decelerates").

Use MAE when

Your targets have outliers (e.g. house prices — a few mansions skew the distribution). You want the model to predict the median rather than the mean.

Avoid MAE when

The gradient discontinuity at zero causes oscillation at the end of training. Not differentiable at exactly zero — requires a subgradient.

4.2 Huber Loss

$$L_\delta(\varepsilon)=\begin{cases}\tfrac{1}{2}\varepsilon^2 & |\varepsilon|\leq\delta \\ \delta\!\left(|\varepsilon|-\tfrac{\delta}{2}\right) & |\varepsilon|>\delta\end{cases}, \quad \varepsilon=\hat{y}-y$$

Huber is quadratic near zero (smooth gradient, fast convergence when close to target) and linear for large errors (robust to outliers). The $\delta$ hyperparameter controls the transition point. Gradient: $\varepsilon$ for $|\varepsilon|\leq\delta$, else $\delta\cdot\text{sign}(\varepsilon)$.

4.3 Binary Cross-Entropy (BCE)

For binary classification where the network output $\hat{p}\in(0,1)$ (after sigmoid) represents $P(y=1)$:

$$L_{\text{BCE}} = -\frac{1}{B}\sum_b\!\left[y_b\log\hat{p}_b + (1-y_b)\log(1-\hat{p}_b)\right]$$

Gradient w.r.t. the sigmoid output: $\frac{\partial L}{\partial \hat{p}} = \frac{\hat{p}-y}{\hat{p}(1-\hat{p})}$. Beautifully, when combined with sigmoid in the final layer, the gradient simplifies to just $\hat{p}-y$ — no division.

Numerical stability

Never compute $\log(\hat{p})$ when $\hat{p}$ can be 0. Always clamp: p.clamp(min=1e-7, max=1-1e-7). In practice you should compute BCE from logits (pre-sigmoid) using the log-sum-exp trick to avoid overflow.

4.4 Categorical Cross-Entropy + Softmax

For $K$-class classification. Softmax converts raw logits $\mathbf{z}\in\mathbb{R}^K$ to probabilities:

$$p_k = \frac{e^{z_k}}{\sum_{j=1}^K e^{z_j}} \qquad L_{\text{CE}} = -\sum_k y_k\log p_k$$

The combined Softmax + Cross-Entropy gradient is the cleanest result in all of deep learning:

$$\frac{\partial L}{\partial z_k} = p_k - y_k$$

where $y_k$ is the one-hot label. The gradient is just "prediction minus truth" — no logs, no divisions.

losses.pyall loss functions
import torch

# ── MAE / L1 ────────────────────────────────────────────────────────────
class MAELoss:
    def forward(self, y_pred, y_true):
        self._diff = y_pred - y_true
        return self._diff.abs().sum() / self._diff.numel()   # O(BN)
    def backward(self):
        return self._diff.sign() / self._diff.numel()         # O(BN)

# ── Huber ────────────────────────────────────────────────────────────────
class HuberLoss:
    def __init__(self, delta: float = 1.0):
        self.delta = delta
    def forward(self, y_pred, y_true):
        self._d = y_pred - y_true
        abs_d = self._d.abs()
        quad = 0.5 * self._d ** 2
        lin  = self.delta * (abs_d - 0.5 * self.delta)
        self._loss = torch.where(abs_d <= self.delta, quad, lin)
        return self._loss.mean()                               # O(BN)
    def backward(self):
        n = self._d.numel()
        # grad = d  when |d| ≤ δ,  else δ·sign(d)
        grad = torch.where(self._d.abs() <= self.delta,
                           self._d,
                           self.delta * self._d.sign())
        return grad / n                                        # O(BN)

# ── Binary Cross-Entropy ─────────────────────────────────────────────────
class BCELoss:
    """Expects y_pred ∈ (0,1) — apply Sigmoid to logits first."""
    def forward(self, y_pred, y_true):
        p = y_pred.clamp(1e-7, 1 - 1e-7)   # numerical safety
        self._p = p; self._y = y_true
        return -(y_true * torch.log(p) + (1 - y_true) * torch.log(1 - p)).mean()
    def backward(self):
        # dL/dp = (p − y) / (p(1−p)) / B
        B = self._p.shape[0]
        return (self._p - self._y) / (self._p * (1 - self._p)) / B

# ── Categorical Cross-Entropy with Softmax ───────────────────────────────
class SoftmaxCELoss:
    """
    Numerically stable: subtract max before exp (log-sum-exp trick).
    Expects logits (B, K) and integer class labels (B,).
    """
    def forward(self, logits, labels):
        # Stable softmax
        z = logits - logits.max(dim=1, keepdim=True).values
        exp_z = torch.exp(z)
        self._p = exp_z / exp_z.sum(dim=1, keepdim=True)    # (B, K)
        self._labels = labels
        B = logits.shape[0]
        log_p = torch.log(self._p.clamp(1e-9))
        # Gather log-prob of the correct class for each sample
        return -log_p[torch.arange(B), labels].mean()
    def backward(self):
        # dL/dz_k = p_k − y_k    (one-hot y, so y_k = 1 for true class)
        B = self._p.shape[0]
        grad = self._p.clone()
        grad[torch.arange(B), self._labels] -= 1.0
        return grad / B

4.5 Comparison of Loss Function Properties

LossGradient at ε=0Gradient at ε=3Outlier robust?Best for
MSE0 (smooth)6/BN (large)NoRegression, when errors are Gaussian
MAE±1/BN (jump)±1/BN (same)YesMedian regression, heavy-tailed errors
Huber(δ=1)0 (smooth)δ/BN (clipped)YesBest of both — default for robust regression
BCEBinary classification (sigmoid output)
Softmax-CEMulti-class classification (K-class)

4.6 Interactive: Loss Function Comparison

Drag the vertical line to inspect loss value and gradient for each function at any error magnitude. Use the Huber δ slider to see how the quadratic-to-linear transition shifts.

Loss at ε = 0.00
MSE
MAE
Huber
BCE
Gradient |dL/dε|
MSE
MAE
Huber
BCE

Conclusion

What we built from first principles:

  1. Linear layer: forward $\mathbf{Y}=\mathbf{X}\mathbf{W}+\mathbf{b}$ in $\mathcal{O}(BMN)$; backward three matrix operations each in $\mathcal{O}(BMN)$; total backward $\approx3\times$ forward.
  2. MLP: stack of linear layers + activations. Total cost $\sum_l \mathcal{O}(B d_l d_{l+1})$; deeper is cheaper than wider for the same parameter count.
  3. Activations: all are element-wise $\mathcal{O}(BN)$, but their gradient shapes make all the difference. Sigmoid/tanh: vanishing gradient for $|x|>3$. ReLU: dead neurons for $x<0$. Leaky/ELU/GELU/Swish: fix both.
  4. Losses: MSE for Gaussian errors; MAE/Huber when outliers exist; BCE for binary; Softmax-CE for multi-class. The combined Softmax-CE gradient $p_k-y_k$ is elegant and stable.
What to try next

Add momentum (SGD+M) or Adam — same backward pass, just smarter update rule. Add BatchNorm — normalises layer inputs, which stabilises training with deep networks. Explore weight decay (L2 regularisation) — just add $\lambda\,\mathbf{W}$ to the gradient at the step.

All visualizations built with Canvas 2D. Math rendered with KaTeX. Code examples are self-contained PyTorch (no torch.nn). Written with the help of Claude.