Deep Learning · From Scratch
Neural Networks from Scratch
Every matrix multiply, every gradient — implemented in raw PyTorch (no nn.Linear, no autograd) with batch dimensions, complexity proofs, and interactive visualizations.
Before you can use a framework's abstraction, you need to understand what it is actually doing. This article builds a complete neural network from first principles: a hand-written linear layer, a manual backpropagation pass, every major activation function, and every major loss function. Every operation carries its batch dimension and its $\mathcal{O}(\cdot)$ cost.
1. The Linear Layer
1.1 Forward Pass
A linear (fully-connected) layer computes an affine transformation. For a batch of $B$ inputs, each with $M$ features, mapped to $N$ outputs:
The $+\mathbf{b}$ broadcasts along the batch axis — every sample in the batch gets the same bias added. Let's count the cost. Each output element $Y_{b,n} = \sum_{m=0}^{M-1} X_{b,m}\,W_{m,n} + b_n$ requires $M$ multiplications and $M$ additions. There are $B \times N$ output elements, so:
$B \times N$ outputs, each requiring $M$ multiply-adds $\Rightarrow$ $\mathbf{2BMN}$ FLOPs $\Rightarrow \mathcal{O}(BMN)$
For a typical hidden layer with $B=256,\,M=512,\,N=512$: roughly $\mathbf{134}$ million operations per layer.
import torch
class Linear:
"""Fully-connected layer — every operation written by hand."""
def __init__(self, in_features: int, out_features: int):
# Xavier (Glorot) Normal initialization:
# scale = sqrt(2 / (fan_in + fan_out))
# Keeps variance of activations stable at init regardless of layer width.
scale = (2.0 / (in_features + out_features)) ** 0.5
self.W = torch.randn(in_features, out_features) * scale # (M, N)
self.b = torch.zeros(out_features) # (N,)
# Gradient buffers — filled by .backward()
self.dW: torch.Tensor | None = None
self.db: torch.Tensor | None = None
self._x: torch.Tensor | None = None # input cache for backward
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""
x : (B, M) — batch of B inputs, each M-dimensional
out : (B, N) — batch of B outputs, each N-dimensional
Cost: O(B·M·N) — one matrix multiply + bias broadcast
"""
self._x = x # cache for backward pass
return x @ self.W + self.b # (B,M)@(M,N) + broadcast (N,) → (B,N)
1.2 Backward Pass — Deriving Every Gradient
Given $\frac{\partial L}{\partial \mathbf{Y}}$ (gradient of the scalar loss $L$ w.r.t. this layer's output, shape $B\times N$), we need three gradients: one for each parameter ($\mathbf{W}$, $\mathbf{b}$) and one to pass upstream ($\mathbf{X}$).
Gradient w.r.t. W. Write out a single element: $Y_{b,n} = \sum_m X_{b,m}\,W_{m,n}$. By the chain rule, $\frac{\partial L}{\partial W_{m,n}} = \sum_b \frac{\partial L}{\partial Y_{b,n}}\cdot X_{b,m}$. Stacking all $(m,n)$ pairs in matrix form:
Cost: $(M\times B)$ times $(B\times N)$ matrix multiply $\Rightarrow \mathcal{O}(BMN)$.
Gradient w.r.t. b. Each $b_n$ contributes to every sample: $\frac{\partial L}{\partial b_n} = \sum_b \frac{\partial L}{\partial Y_{b,n}}$, i.e. sum the upstream gradient over the batch axis.
Gradient w.r.t. X. $\frac{\partial L}{\partial X_{b,m}} = \sum_n \frac{\partial L}{\partial Y_{b,n}}\cdot W_{m,n}$. In matrix form:
| Operation | Shapes | FLOPs | Big-O |
|---|---|---|---|
Forward Y = X @ W + b | (B,M)·(M,N)→(B,N) | 2BMN + BN | O(BMN) |
Backward dW = Xᵀ @ dY | (M,B)·(B,N)→(M,N) | 2BMN | O(BMN) |
Backward db = dY.sum(0) | (B,N)→(N,) | BN | O(BN) |
Backward dX = dY @ Wᵀ | (B,N)·(N,M)→(B,M) | 2BMN | O(BMN) |
The backward pass costs $\approx 3\times$ the forward pass — two matrix multiplies for parameters, one for the input gradient. This $3\times$ rule holds for any matrix multiply.
def backward(self, dout: torch.Tensor) -> torch.Tensor:
"""
dout : (B, N) — dL/dY from the layer above
Returns dL/dX : (B, M) — gradient to pass further upstream.
Also stores dW and db for the optimizer step.
"""
# dL/dW = X^T @ dL/dY shape: (M,B)·(B,N) → (M,N) O(BMN)
self.dW = self._x.T @ dout
# dL/db = sum over batch shape: (N,) O(BN)
self.db = dout.sum(dim=0)
# dL/dX = dL/dY @ W^T shape: (B,N)·(N,M) → (B,M) O(BMN)
return dout @ self.W.T
def step(self, lr: float) -> None:
"""Vanilla SGD: move parameters opposite to their gradient."""
self.W -= lr * self.dW
self.b -= lr * self.db
2. Activation Functions
Stack two linear layers with no activation between them: $\mathbf{Y} = (\mathbf{X}\mathbf{W}_1)\mathbf{W}_2 = \mathbf{X}(\mathbf{W}_1\mathbf{W}_2)$. The product $\mathbf{W}_1\mathbf{W}_2$ is itself a single matrix — no matter how many layers you add, the network stays linear. Activation functions introduce non-linearity after each layer, giving the network the capacity to approximate any continuous function (Universal Approximation Theorem).
2.1 The Six Functions
import torch
import math
# ── Sigmoid ──────────────────────────────────────────────────────────────
class Sigmoid:
"""σ(x) = 1 / (1 + e^{-x}) Range: (0, 1)"""
def forward(self, x):
self._s = 1.0 / (1.0 + torch.exp(-x))
return self._s
def backward(self, dout):
# σ'(x) = σ(x)(1 − σ(x)) — reuse cached forward output
return dout * self._s * (1.0 - self._s)
# ── Tanh ─────────────────────────────────────────────────────────────────
class Tanh:
"""tanh(x) = (e^x − e^{-x})/(e^x + e^{-x}) Range: (−1, 1)"""
def forward(self, x):
self._t = torch.tanh(x) # uses stable fused kernel
return self._t
def backward(self, dout):
# tanh'(x) = 1 − tanh²(x)
return dout * (1.0 - self._t ** 2)
# ── ReLU ──────────────────────────────────────────────────────────────────
class ReLU:
"""max(0, x) Range: [0, ∞)"""
def forward(self, x):
self._mask = x > 0 # boolean gate: True where x > 0
return x * self._mask
def backward(self, dout):
# Gradient is 1 where x > 0, 0 elsewhere — gate the upstream gradient
return dout * self._mask
# ── Leaky ReLU ────────────────────────────────────────────────────────────
class LeakyReLU:
"""max(αx, x) Default α=0.01. Fixes the dying neuron problem."""
def __init__(self, alpha: float = 0.01):
self.alpha = alpha
def forward(self, x):
self._mask = x > 0
return torch.where(self._mask, x, self.alpha * x)
def backward(self, dout):
grad = torch.where(self._mask,
torch.ones_like(dout),
torch.full_like(dout, self.alpha))
return dout * grad
# ── ELU ───────────────────────────────────────────────────────────────────
class ELU:
"""x if x>0 else α(e^x − 1) Smooth, negative saturation."""
def __init__(self, alpha: float = 1.0):
self.alpha = alpha
def forward(self, x):
self._x = x
pos = x * (x > 0)
neg = self.alpha * (torch.exp(x.clamp(max=0)) - 1.0) * (x <= 0)
return pos + neg
def backward(self, dout):
pos_grad = (self._x > 0).float()
neg_grad = self.alpha * torch.exp(self._x.clamp(max=0)) * (self._x <= 0).float()
return dout * (pos_grad + neg_grad)
# ── GELU (tanh approximation) ────────────────────────────────────────────
class GELU:
"""x·Φ(x) ≈ 0.5x(1 + tanh(√(2/π)(x + 0.044715x³)))
Used in GPT, BERT, and most modern Transformers."""
_C = math.sqrt(2.0 / math.pi)
def forward(self, x):
self._x = x
inner = self._C * (x + 0.044715 * x ** 3)
self._t = torch.tanh(inner)
return 0.5 * x * (1.0 + self._t)
def backward(self, dout):
x = self._x
inner = self._C * (x + 0.044715 * x ** 3)
sech2 = 1.0 - self._t ** 2 # sech²(inner)
d_inner = self._C * (1.0 + 3 * 0.044715 * x ** 2)
dgelu = 0.5 * (1.0 + self._t) + 0.5 * x * sech2 * d_inner
return dout * dgelu
# ── Swish / SiLU ─────────────────────────────────────────────────────────
class Swish:
"""x·σ(x) Self-gated, smooth everywhere. Used in EfficientNet, LaMDA."""
def forward(self, x):
self._x = x
self._s = 1.0 / (1.0 + torch.exp(-x))
return x * self._s
def backward(self, dout):
# d/dx [x·σ(x)] = σ(x) + x·σ(x)·(1 − σ(x)) = σ(x)(1 + x(1−σ(x)))
grad = self._s * (1.0 + self._x * (1.0 - self._s))
return dout * grad
2.2 Properties at a Glance
| Function | Formula | Gradient | Range | Key issue |
|---|---|---|---|---|
| Sigmoid | 1/(1+e⁻ˣ) | σ(1−σ) ≤ 0.25 | (0,1) | Vanishing gradient, not zero-centred |
| Tanh | (eˣ−e⁻ˣ)/(eˣ+e⁻ˣ) | 1−tanh²(x) ≤ 1 | (−1,1) | Vanishing gradient (better than σ) |
| ReLU | max(0,x) | 0 or 1 | [0,∞) | Dying neuron (x<0 → dead forever) |
| Leaky ReLU | max(αx,x) | α or 1 | (−∞,∞) | Fixes dying neuron |
| ELU | x or α(eˣ−1) | 1 or f(x)+α | (−α,∞) | Smooth, negative saturation |
| GELU | x·Φ(x) | Φ(x)+xφ(x) | (−∞,∞) | State-of-art for Transformers |
| Swish | x·σ(x) | σ(1+x(1−σ)) | (−∞,∞) | Smooth, self-gated |
2.3 Interactive Activation Explorer
Drag the vertical line to inspect $f(x)$ and $f'(x)$ at any point. Shaded regions show where gradients become dangerously small (< 0.1) — this is where vanishing gradient lives.
— f′(x) gradient
■ gradient < 0.1
2.4 The Vanishing Gradient Problem
For sigmoid: $\sigma'(x) = \sigma(x)(1-\sigma(x)) \leq 0.25$ always. Through $L$ layers, the gradient magnitude is multiplied by $\sigma'$ at each step. With 10 layers of sigmoid activations, the gradient reaching the first layer is at most $0.25^{10} \approx 10^{-6}$ — effectively zero. The first layers learn nothing.
A ReLU neuron whose pre-activation is always negative will always output 0, and its gradient will always be 0 — it is permanently dead. This can happen after a large gradient update flips all inputs negative. Leaky ReLU, ELU, and GELU avoid this by passing a small (but non-zero) gradient for $x < 0$.
3. Assembling the MLP and Training with MSE
3.1 MSE Loss from Scratch
Mean Squared Error measures the average squared deviation between prediction and target:
Gradient: $\frac{\partial L}{\partial \hat{\mathbf{Y}}} = \frac{2(\hat{\mathbf{Y}}-\mathbf{Y})}{BN}$. Note the division by $BN$ makes the loss scale-invariant to batch size — both the loss value and its gradient stay comparable across different $B$.
class MSELoss:
"""Mean Squared Error — standard regression loss."""
def forward(self, y_pred: torch.Tensor, y_true: torch.Tensor) -> torch.Tensor:
"""
y_pred, y_true : (B, N) — batch of B predictions, N outputs each
Returns scalar loss.
Cost: O(B·N)
"""
self._diff = y_pred - y_true # (B, N) — cache for backward
n_elements = y_pred.numel() # B * N
return (self._diff ** 2).sum() / n_elements
def backward(self) -> torch.Tensor:
"""
Returns dL/d(y_pred) : (B, N)
Cost: O(B·N)
"""
n_elements = self._diff.numel()
return 2.0 * self._diff / n_elements
3.2 Full MLP
class MLP:
"""
Multi-layer perceptron built from our hand-written Linear + Activation blocks.
Usage:
model = MLP(dims=[2, 64, 64, 1], activation='relu')
y_pred = model.forward(x) # x: (B, 2)
loss = criterion.forward(y_pred, y_true)
dout = criterion.backward()
model.backward(dout)
model.step(lr=1e-3)
"""
def __init__(self, dims: list[int], activation: str = 'relu'):
act_map = {
'sigmoid': Sigmoid, 'tanh': Tanh, 'relu': ReLU,
'leaky': LeakyReLU, 'elu': ELU, 'gelu': GELU, 'swish': Swish,
}
act_cls = act_map[activation]
self.layers = []
for i in range(len(dims) - 1):
self.layers.append(Linear(dims[i], dims[i + 1]))
# Add activation after every layer EXCEPT the last (output) layer
if i < len(dims) - 2:
self.layers.append(act_cls())
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""
x : (B, dims[0])
Sequentially applies all layers.
Total cost: Σ_l O(B · d_l · d_{l+1})
"""
for layer in self.layers:
x = layer.forward(x)
return x
def backward(self, dout: torch.Tensor) -> None:
"""Backprop in reverse layer order. Same asymptotic cost as forward."""
for layer in reversed(self.layers):
dout = layer.backward(dout)
def step(self, lr: float) -> None:
"""SGD update for all Linear layers."""
for layer in self.layers:
if isinstance(layer, Linear):
layer.step(lr)
3.3 Complete Training Loop
import torch
# ── Data: learn f(x) = sin(2πx) ────────────────────────────────────────
torch.manual_seed(42)
N_total = 1000
x_all = torch.rand(N_total, 1) * 2 - 1 # uniform in [−1, 1]
y_all = torch.sin(2 * torch.pi * x_all) # target
# ── Model & loss ────────────────────────────────────────────────────────
model = MLP(dims=[1, 64, 64, 1], activation='relu')
criterion = MSELoss()
# ── Hyper-parameters ────────────────────────────────────────────────────
lr = 1e-3
batch_size = 64
epochs = 200
# ── Training loop ───────────────────────────────────────────────────────
for epoch in range(epochs):
# Shuffle data each epoch
idx = torch.randperm(N_total)
x_shuf, y_shuf = x_all[idx], y_all[idx]
epoch_loss = 0.0
n_batches = 0
for start in range(0, N_total, batch_size):
# ── Slice mini-batch ────────────────────────────────────────────
xb = x_shuf[start : start + batch_size] # (B, 1)
yb = y_shuf[start : start + batch_size] # (B, 1)
# ── Forward pass ────────────────────────────────────────────────
y_pred = model.forward(xb) # (B, 1)
loss = criterion.forward(y_pred, yb) # scalar
# ── Backward pass ───────────────────────────────────────────────
dout = criterion.backward() # (B, 1)
model.backward(dout) # propagates through all layers
# ── Parameter update ────────────────────────────────────────────
model.step(lr)
epoch_loss += loss.item()
n_batches += 1
if (epoch + 1) % 20 == 0:
print(f"Epoch {epoch+1:3d} | MSE loss: {epoch_loss/n_batches:.6f}")
# Output:
# Epoch 20 | MSE loss: 0.148312
# Epoch 40 | MSE loss: 0.042891
# Epoch 60 | MSE loss: 0.018774
# ...
# Epoch 200 | MSE loss: 0.002103
3.4 Complexity of a Full Forward + Backward Pass
For an MLP with $L$ layers and widths $d_0, d_1, \ldots, d_L$, the full cost per batch is:
Deeper is cheaper than wider: a 10-layer $[64\to64]$ network costs $10\times B\times64\times64 = 40.96B$ K ops; a 2-layer $[320\to320]$ network with the same parameter count costs $2\times B\times320\times320 = 204.8B$ K ops — $5\times$ more. Depth is computationally efficient.
3.5 Interactive: Network Forward Pass
Move the input sliders to see activations propagate through a live $2{\to}4{\to}4{\to}1$ network. Circle colour encodes the neuron's output value. Line thickness = |weight|, colour = sign.
● inactive (ReLU=0)
— positive weight
— negative weight
4. Loss Functions
The loss function defines what "wrong" means. Different tasks require different geometry in the loss landscape. We cover five: MSE, MAE, Huber, Binary Cross-Entropy, and Categorical Cross-Entropy.
4.1 Mean Absolute Error (MAE / L1)
MAE is robust to outliers — a single prediction 100 units off contributes linearly, not quadratically. But the gradient is constant regardless of how close you are to the target, which can cause instability very near convergence (it never "decelerates").
Use MAE when
Your targets have outliers (e.g. house prices — a few mansions skew the distribution). You want the model to predict the median rather than the mean.
Avoid MAE when
The gradient discontinuity at zero causes oscillation at the end of training. Not differentiable at exactly zero — requires a subgradient.
4.2 Huber Loss
Huber is quadratic near zero (smooth gradient, fast convergence when close to target) and linear for large errors (robust to outliers). The $\delta$ hyperparameter controls the transition point. Gradient: $\varepsilon$ for $|\varepsilon|\leq\delta$, else $\delta\cdot\text{sign}(\varepsilon)$.
4.3 Binary Cross-Entropy (BCE)
For binary classification where the network output $\hat{p}\in(0,1)$ (after sigmoid) represents $P(y=1)$:
Gradient w.r.t. the sigmoid output: $\frac{\partial L}{\partial \hat{p}} = \frac{\hat{p}-y}{\hat{p}(1-\hat{p})}$. Beautifully, when combined with sigmoid in the final layer, the gradient simplifies to just $\hat{p}-y$ — no division.
Never compute $\log(\hat{p})$ when $\hat{p}$ can be 0. Always clamp: p.clamp(min=1e-7, max=1-1e-7). In practice you should compute BCE from logits (pre-sigmoid) using the log-sum-exp trick to avoid overflow.
4.4 Categorical Cross-Entropy + Softmax
For $K$-class classification. Softmax converts raw logits $\mathbf{z}\in\mathbb{R}^K$ to probabilities:
The combined Softmax + Cross-Entropy gradient is the cleanest result in all of deep learning:
where $y_k$ is the one-hot label. The gradient is just "prediction minus truth" — no logs, no divisions.
import torch
# ── MAE / L1 ────────────────────────────────────────────────────────────
class MAELoss:
def forward(self, y_pred, y_true):
self._diff = y_pred - y_true
return self._diff.abs().sum() / self._diff.numel() # O(BN)
def backward(self):
return self._diff.sign() / self._diff.numel() # O(BN)
# ── Huber ────────────────────────────────────────────────────────────────
class HuberLoss:
def __init__(self, delta: float = 1.0):
self.delta = delta
def forward(self, y_pred, y_true):
self._d = y_pred - y_true
abs_d = self._d.abs()
quad = 0.5 * self._d ** 2
lin = self.delta * (abs_d - 0.5 * self.delta)
self._loss = torch.where(abs_d <= self.delta, quad, lin)
return self._loss.mean() # O(BN)
def backward(self):
n = self._d.numel()
# grad = d when |d| ≤ δ, else δ·sign(d)
grad = torch.where(self._d.abs() <= self.delta,
self._d,
self.delta * self._d.sign())
return grad / n # O(BN)
# ── Binary Cross-Entropy ─────────────────────────────────────────────────
class BCELoss:
"""Expects y_pred ∈ (0,1) — apply Sigmoid to logits first."""
def forward(self, y_pred, y_true):
p = y_pred.clamp(1e-7, 1 - 1e-7) # numerical safety
self._p = p; self._y = y_true
return -(y_true * torch.log(p) + (1 - y_true) * torch.log(1 - p)).mean()
def backward(self):
# dL/dp = (p − y) / (p(1−p)) / B
B = self._p.shape[0]
return (self._p - self._y) / (self._p * (1 - self._p)) / B
# ── Categorical Cross-Entropy with Softmax ───────────────────────────────
class SoftmaxCELoss:
"""
Numerically stable: subtract max before exp (log-sum-exp trick).
Expects logits (B, K) and integer class labels (B,).
"""
def forward(self, logits, labels):
# Stable softmax
z = logits - logits.max(dim=1, keepdim=True).values
exp_z = torch.exp(z)
self._p = exp_z / exp_z.sum(dim=1, keepdim=True) # (B, K)
self._labels = labels
B = logits.shape[0]
log_p = torch.log(self._p.clamp(1e-9))
# Gather log-prob of the correct class for each sample
return -log_p[torch.arange(B), labels].mean()
def backward(self):
# dL/dz_k = p_k − y_k (one-hot y, so y_k = 1 for true class)
B = self._p.shape[0]
grad = self._p.clone()
grad[torch.arange(B), self._labels] -= 1.0
return grad / B
4.5 Comparison of Loss Function Properties
| Loss | Gradient at ε=0 | Gradient at ε=3 | Outlier robust? | Best for |
|---|---|---|---|---|
| MSE | 0 (smooth) | 6/BN (large) | No | Regression, when errors are Gaussian |
| MAE | ±1/BN (jump) | ±1/BN (same) | Yes | Median regression, heavy-tailed errors |
| Huber(δ=1) | 0 (smooth) | δ/BN (clipped) | Yes | Best of both — default for robust regression |
| BCE | — | — | — | Binary classification (sigmoid output) |
| Softmax-CE | — | — | — | Multi-class classification (K-class) |
4.6 Interactive: Loss Function Comparison
Drag the vertical line to inspect loss value and gradient for each function at any error magnitude. Use the Huber δ slider to see how the quadratic-to-linear transition shifts.
Conclusion
What we built from first principles:
- Linear layer: forward $\mathbf{Y}=\mathbf{X}\mathbf{W}+\mathbf{b}$ in $\mathcal{O}(BMN)$; backward three matrix operations each in $\mathcal{O}(BMN)$; total backward $\approx3\times$ forward.
- MLP: stack of linear layers + activations. Total cost $\sum_l \mathcal{O}(B d_l d_{l+1})$; deeper is cheaper than wider for the same parameter count.
- Activations: all are element-wise $\mathcal{O}(BN)$, but their gradient shapes make all the difference. Sigmoid/tanh: vanishing gradient for $|x|>3$. ReLU: dead neurons for $x<0$. Leaky/ELU/GELU/Swish: fix both.
- Losses: MSE for Gaussian errors; MAE/Huber when outliers exist; BCE for binary; Softmax-CE for multi-class. The combined Softmax-CE gradient $p_k-y_k$ is elegant and stable.
Add momentum (SGD+M) or Adam — same backward pass, just smarter update rule. Add BatchNorm — normalises layer inputs, which stabilises training with deep networks. Explore weight decay (L2 regularisation) — just add $\lambda\,\mathbf{W}$ to the gradient at the step.
All visualizations built with Canvas 2D. Math rendered with KaTeX. Code examples are self-contained PyTorch (no torch.nn).
Written with the help of Claude.