Computer Vision Fundamentals

The Geometry of a Camera

How a camera maps the 3D world onto a 2D pixel grid — step by step, with interactive visualizations for every stage.

The Full Pipeline at a Glance

A 3D world point travels through four transformations before becoming a pixel on your screen. Each stage is handled by a matrix multiplication — and we can compose them all into a single $3 \times 4$ camera matrix $P$.

3D World Point
$(X_w,Y_w,Z_w)$
World Space
[R|t]
Extrinsics
3D Camera Point
$(X_c,Y_c,Z_c)$
Camera Space
K
Intrinsics
Homogeneous
$(x',y',w')$
Image Space
÷$Z_c$
Persp. Divide
2D Pixel
$(u,v)$
Screen Space
$$\begin{bmatrix}u \\ v \\ 1\end{bmatrix} \sim P \begin{bmatrix}X_w \\ Y_w \\ Z_w \\ 1\end{bmatrix},\qquad P = K\,[R\,|\,t]$$

The $\sim$ means "equal up to scale" (the homogeneous divide). Let's build each piece from scratch.

Primer: Homogeneous Coordinates

To rotate and translate a point in one matrix multiply, we need a trick. In 3D Cartesian coordinates:

The fix: lift the point into 4D by appending a $1$. Now both rotation and translation fit into a single $3\times4$ multiply.

$$\underbrace{\begin{bmatrix}X_w\\Y_w\\Z_w\\1\end{bmatrix}}_{\text{Homogeneous}} \xrightarrow{\times\,[R|t]} \begin{bmatrix}X_c\\Y_c\\Z_c\end{bmatrix} = R\begin{bmatrix}X_w\\Y_w\\Z_w\end{bmatrix}+t$$

When you convert back, just divide by the last coordinate. Any scalar multiple $\lambda(x,y,w)^T$ represents the same 2D point $(x/w,\,y/w)$.

0 — World Space: Where Does Everything Live?

Before any camera math, we need a shared stage: the world coordinate system. It is a fixed reference frame defined by a chosen origin point and three perpendicular axes (X, Y, Z). Every object, every camera, and every measurement in your scene is expressed relative to this one frame.

The Origin Is Your Choice

There is no "correct" world origin. It is wherever you decide it should be:

Once chosen, every 3D point $(X_w, Y_w, Z_w)$ is measured in your chosen units (meters, millimetres, etc.) along your chosen axes. Crucially, the origin choice changes the numbers but not the physics — the objects don't move, only how we describe them changes.

Interact: Click a preset below to move the world origin. The boxes stay in place, but their coordinate labels update — because coordinates are always relative to the origin.

World Origin & Frame
Origin $O$(0.0, 0.0, 0.0)
Axes
■ X → East / Right
■ Y → Up
■ Z → South / Forward
Object Coords in This Frame

Point Coordinates Change, Distances Don't

Moving the origin changes $(X_w, Y_w, Z_w)$ values, but it never changes the distance between two objects, or the angle between two directions. Those are intrinsic geometric properties. This is why you can freely pick any convenient origin — it only affects bookkeeping, not geometry.

$$\text{If origin moves by } \Delta O,\quad X_w^{\text{new}} = X_w^{\text{old}} - \Delta O_x \;\text{(and similarly Y, Z)}$$

Axis Conventions Vary by Field

The orientation of the axes is also a convention. Different communities disagree — which causes bugs whenever you mix tools. The key rule: always check which convention a library uses before plugging in numbers.

SystemXYZCamera looks toward
OpenCV (this blog)RightDown ↓Into scene+Z
OpenGL / WebGL / Three.jsRightUp ↑Out of screen−Z
ROS / RoboticsForwardLeftUp ↑+X
Unreal EngineForwardRightUp ↑+X
Unity 3DRightUp ↑Into scene+Z

From World Space to Camera Space

A camera is just another object living at some position $\mathbf{C} = (C_x, C_y, C_z)$ in world space, pointing in some direction. The job of the extrinsic matrix is to express every world point relative to the camera instead of relative to the world origin. After this transformation, the camera sits at the origin and looks along +Z — which is the coordinate frame the intrinsic matrix $K$ expects.

$$\underbrace{P_{\text{camera}}}_{\text{point relative to camera}} = R\,\underbrace{P_{\text{world}}}_{\text{point in world}} + \underbrace{t}_{t = -R\,C}$$

$R$ rotates world axes to align with camera axes. $t$ shifts the origin from the world origin to the camera position. Together, $[R\,|\,t]$ is a rigid-body transformation — it preserves distances and angles.

1 — Extrinsics: World → Camera Space

The extrinsic matrix $[R\,|\,t]$ answers: where is the camera sitting in the world, and which way is it pointing?

$$\begin{bmatrix}X_c\\Y_c\\Z_c\end{bmatrix} = \underbrace{\begin{bmatrix}r_{11}&r_{12}&r_{13}&t_x\\r_{21}&r_{22}&r_{23}&t_y\\r_{31}&r_{32}&r_{33}&t_z\end{bmatrix}}_{[R\,|\,t]}\begin{bmatrix}X_w\\Y_w\\Z_w\\1\end{bmatrix}$$

Interact: Move the camera around the world scene below. The colored boxes are fixed reference objects. Watch how the rotation matrix $R$ changes as you turn the camera, and how $t$ tracks its position. The small inset shows what the camera actually sees.

Camera POV
Translation $t$ (Camera Position in World)
$t_x$0.00
$t_y$2.00
$t_z$4.00
Rotation Matrix $R$ (Camera Axes → World)
[
1.000.000.00 0.001.000.00 0.000.001.00
]
Columns = camera X(red), Y(green), Z(blue) in world

Key Insight: $R$ Is Orthogonal

Because $R$ is a pure rotation, its columns are mutually perpendicular unit vectors. This means $R^{-1} = R^T$ — you can undo any rotation just by transposing. The determinant of $R$ is always $+1$ (no reflection, no scaling).

2 — Intrinsics: Camera Space → Image Plane

The intrinsic matrix $K$ encodes the physical properties of the camera's sensor and lens. It maps the 3D camera-space point $(X_c, Y_c, Z_c)$ to a 2D image point.

$$\begin{bmatrix}x'\\y'\\w'\end{bmatrix} = \underbrace{\begin{bmatrix}f_x & \gamma & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1\end{bmatrix}}_{K} \begin{bmatrix}X_c\\Y_c\\Z_c\end{bmatrix}$$

The four parameters that matter most:

After multiplying, we get $w' = Z_c$ (the depth). Dividing $x'$ and $y'$ by $w'$ gives the final pixel.

$$u = \frac{x'}{w'} = f_x\frac{X_c}{Z_c} + c_x, \qquad v = \frac{y'}{w'} = f_y\frac{Y_c}{Z_c} + c_y$$

Interact: The left panel shows the 3D pinhole geometry — drag to orbit. The right panel shows the image plane: where does your chosen 3D point land in pixels? Adjust the K matrix parameters and watch the dot move.

Pinhole Geometry (Camera Space)
Object $X_c$0.0
Object $Y_c$1.5
Object $Z_c$ (depth)8.0
$X_c/Z_c$0.000
$Y_c/Z_c$0.188
Image Plane (640×480 sensor)
Drag the 2D dot to move the 3D point
K Matrix (Live)
[
500 0 320 0 500 240 0 0 1
]
u = fx·(Xc/Zc) + cx = 320
v = fy·(Yc/Zc) + cy = 334
K Matrix Parameters
3D Point in Camera Space

Why does $f_x$ behave like zoom?

$f_x$ multiplies the angle ratio $X_c/Z_c$. Doubling $f_x$ doubles the image size of every object — exactly like zooming in. This is why telephoto lenses (long focal length) make distant objects appear large.

3 — Perspective Divide: Why Far Objects Look Small

After applying $K$, we have the homogeneous coordinate $(x', y', w')$. Converting to actual pixels requires dividing by $w'$. Since the bottom row of $K$ is $[0\;0\;1]$, we always get $w' = Z_c$ — the depth. So:

$$u = \frac{x'}{Z_c},\quad v = \frac{y'}{Z_c}$$

Dividing by $Z_c$ is what makes this perspective projection. The deeper an object (larger $Z_c$), the smaller it appears in the image. This is the mathematical reason for the basic visual phenomenon: objects shrink with distance.

Proof via Similar Triangles

Consider a side view of the pinhole camera. A point at height $Y_c$ and depth $Z_c$ casts a ray through the pinhole. By similar triangles, the image height $y'$ satisfies:

$$\frac{y'}{f} = \frac{Y_c}{Z_c} \quad\Longrightarrow\quad y' = f \cdot \frac{Y_c}{Z_c}$$

Interact: Adjust the object height, depth, and focal length. Watch the two similar triangles (red = real, cyan = image) scale together, always maintaining the ratio $Y_c/Z_c = y'/f$.

y' = f × Y/Z = 3.0 × 3.0 / 8.0 = 1.125   |   Y/Z ratio = 0.375

Connecting $f$ to Field of View

Focal length and field of view (FoV) are inversely related. Given a sensor half-width $W/2$ and focal length $f$, the horizontal half-angle is:

$$\text{FoV}_h = 2\arctan\!\left(\frac{W/2}{f_x}\right)$$

A small $f_x$ → wide angle. A large $f_x$ → narrow telephoto. This is why wide-angle lenses seem to "fit more in the frame."

4 — Full Pipeline: End-to-End Calculator

Let's trace a real point through every step. Adjust the 3D world point and camera setup; each matrix multiplication is shown with the actual computed numbers.

Live Pipeline Calculator: $(X_w,Y_w,Z_w,1)^T \xrightarrow{K[R|t]} (u,v)$

Step 1 — Apply Extrinsics [R|t]

...
Result: Camera coords ...

Step 2 — Apply Intrinsics K

...
Raw: Homogeneous $(x', y', w')$ ...

Step 3 — Perspective Divide (÷$Z_c$)

$u = x' / Z_c = $ ... = ... px
$v = y' / Z_c = $ ... = ... px
640×480

5 — Lens Distortion (Real Cameras)

The K matrix assumes a perfect pinhole — a mathematical idealization. Real lenses introduce distortion, which makes straight lines curve in the image. The two dominant types:

$$r^2 = x_n^2 + y_n^2,\qquad x_{\text{dist}} = x_n(1 + k_1 r^2 + k_2 r^4) + 2p_1 x_n y_n + p_2(r^2 + 2x_n^2)$$

Camera calibration (e.g., using a chessboard with OpenCV's calibrateCamera) estimates $K$ and the distortion coefficients simultaneously.

6 — From Photons to Pixels: ISO & Noise

Once the geometry is done, the physical photons hitting each pixel are converted to a digital value. The ISO setting is a digital gain applied to the raw photoelectron count. Amplifying the signal also amplifies sensor noise — which is why high-ISO photos look grainy.

$$\text{Pixel Value} = \text{clamp}\!\left(\text{ISO}_{\text{gain}} \times (\text{Photons} + \mathcal{N}(0,\sigma)),\;0,\;255\right)$$

Interact: Raise ISO and watch the scene brighten — then become noisy. At ISO 25600 the noise dominates.

Digital Signal
Base Photons15 (dark scene)
ISO100
Gain ($×$)1.0×
Noise $\sigma$0
Brightness15/255

7 — Do Real Cameras Run This Math?

Here is the answer most people get wrong: no. The matrix multiplications $[R\,|\,t]$ and $K$ are not computed inside the camera hardware. The perspective projection happens entirely through physics — light and optics do it automatically. The math is a model that lets software reason about what the physics did.

The Physical Pipeline Inside a Camera

Click each stage below for details. Notice where the camera geometry math maps onto the physics, and where actual computation begins.

① Scene — Light Travelling in Straight Lines

Photons bounce off every surface and travel outward in straight lines. This straight-line travel is the physical mechanism behind perspective projection — an object twice as far away subtends half the angle, so it appears half as large. No computation needed: the geometry is baked into the physics of light.

This is what $[R\,|\,t]$ and $K$ model mathematically: they describe the same geometry that photons obey naturally.

② Aperture & Lens — Optical Projection

The aperture (f-number) controls how much light enters. The lens bends incoming rays so that every photon from a single scene point converges to a single point on the sensor. This convergence is the physical implementation of focal length $f_x, f_y$.

A longer focal-length lens bends light at a shallower angle → smaller field of view → apparent zoom. A wider lens bends more aggressively → larger FoV → distortion at the edges (barrel/pincushion, modelled by $k_1, k_2$). Still no computation — this is pure optics.

③ CMOS/CCD Sensor — Photons → Electrons

Each pixel is a photodiode that accumulates electrons proportional to the number of photons hitting it during the exposure. Modern sensors use a Bayer colour filter array: a repeating RGGB mosaic so each physical pixel captures only one colour channel. A 12 MP sensor captures ~6 M green, ~3 M red, and ~3 M blue measurements — not 12 M full-colour pixels.

The ISO setting multiplies the charge before the next stage, amplifying both signal and noise.

④ ADC — Analog → Digital (RAW)

An Analog-to-Digital Converter reads the voltage on each photodiode and converts it to a binary integer. Consumer cameras use 10–14 bits per channel, giving 1024–16384 discrete brightness levels. The raw binary grid of single-channel values is the RAW file format — a direct readout of the sensor, before any image processing.

⑤ ISP — Image Signal Processor

The ISP is a dedicated chip (or SoC block) that runs image algorithms in real time at 30–120 fps. This is where actual computation first happens inside the camera:

⑥ Output — JPEG / RAW / Video Stream

The processed image exits as a JPEG, HEIF, RAW file, or a live MIPI/USB stream to a host computer. At this point the data is a 2D array of RGB values. All 3D depth information has been permanently lost in the projection — this is why recovering 3D structure from images (SfM, stereo, depth estimation) is a hard inverse problem.

On smartphones, a final NPU (Neural Processing Unit) pass may run computational photography: Night Sight, Super Res Zoom, computational bokeh, or scene-adaptive sharpening — these are matrix-heavy neural network operations on 2D images.

So where does $P = K\,[R\,|\,t]$ actually run?

In software, not in the camera. The projection matrix is used whenever a program needs to reason about the relationship between 3D geometry and 2D images:

Real-Time Performance

Inside the Camera (ISP)

Dedicated fixed-function hardware. Processes 50 MP at 30 fps (1.5 Gpixels/s). No general-purpose CPU involved in demosaic/denoise/compress. Uses 100–500 mW.

In a Vision System (GPU)

An NVIDIA Jetson can run full camera projection + neural detection on 4 cameras at 30 fps. A data-centre GPU processes batch camera projections for 3D reconstruction at thousands of fps equivalent.

Smartphone Computational Photo

Apple A17 Pro NPU: ~35 TOPS for neural-network-based denoising, super-resolution, and bokeh synthesis. Runs in parallel with the ISP on each captured frame.

Camera Calibration Workflow

Capture 20–50 images of a chessboard → OpenCV detects corners → non-linear optimisation estimates $K$, $k_1$–$k_3$, $p_1$–$p_2$ → reprojection error typically < 0.5 px RMS.