Computer Vision Fundamentals
The Geometry of a Camera
How a camera maps the 3D world onto a 2D pixel grid — step by step, with interactive visualizations for every stage.
The Full Pipeline at a Glance
A 3D world point travels through four transformations before becoming a pixel on your screen. Each stage is handled by a matrix multiplication — and we can compose them all into a single $3 \times 4$ camera matrix $P$.
The $\sim$ means "equal up to scale" (the homogeneous divide). Let's build each piece from scratch.
Primer: Homogeneous Coordinates
To rotate and translate a point in one matrix multiply, we need a trick. In 3D Cartesian coordinates:
- Rotation: multiply by a $3\times3$ matrix $R$.
- Translation: add a vector $t$. You cannot embed addition into a matrix multiply in 3D.
The fix: lift the point into 4D by appending a $1$. Now both rotation and translation fit into a single $3\times4$ multiply.
When you convert back, just divide by the last coordinate. Any scalar multiple $\lambda(x,y,w)^T$ represents the same 2D point $(x/w,\,y/w)$.
0 — World Space: Where Does Everything Live?
Before any camera math, we need a shared stage: the world coordinate system. It is a fixed reference frame defined by a chosen origin point and three perpendicular axes (X, Y, Z). Every object, every camera, and every measurement in your scene is expressed relative to this one frame.
The Origin Is Your Choice
There is no "correct" world origin. It is wherever you decide it should be:
- A robotics lab might set the origin at the robot's starting position.
- A construction site uses a surveyed ground-control point.
- A film studio might use the center of the stage.
- A self-driving car stack resets the origin to the car's GPS position at boot time.
Once chosen, every 3D point $(X_w, Y_w, Z_w)$ is measured in your chosen units (meters, millimetres, etc.) along your chosen axes. Crucially, the origin choice changes the numbers but not the physics — the objects don't move, only how we describe them changes.
Interact: Click a preset below to move the world origin. The boxes stay in place, but their coordinate labels update — because coordinates are always relative to the origin.
■ Y → Up
■ Z → South / Forward
Point Coordinates Change, Distances Don't
Moving the origin changes $(X_w, Y_w, Z_w)$ values, but it never changes the distance between two objects, or the angle between two directions. Those are intrinsic geometric properties. This is why you can freely pick any convenient origin — it only affects bookkeeping, not geometry.
Axis Conventions Vary by Field
The orientation of the axes is also a convention. Different communities disagree — which causes bugs whenever you mix tools. The key rule: always check which convention a library uses before plugging in numbers.
| System | X | Y | Z | Camera looks toward |
|---|---|---|---|---|
| OpenCV (this blog) | Right | Down ↓ | Into scene | +Z |
| OpenGL / WebGL / Three.js | Right | Up ↑ | Out of screen | −Z |
| ROS / Robotics | Forward | Left | Up ↑ | +X |
| Unreal Engine | Forward | Right | Up ↑ | +X |
| Unity 3D | Right | Up ↑ | Into scene | +Z |
From World Space to Camera Space
A camera is just another object living at some position $\mathbf{C} = (C_x, C_y, C_z)$ in world space, pointing in some direction. The job of the extrinsic matrix is to express every world point relative to the camera instead of relative to the world origin. After this transformation, the camera sits at the origin and looks along +Z — which is the coordinate frame the intrinsic matrix $K$ expects.
$R$ rotates world axes to align with camera axes. $t$ shifts the origin from the world origin to the camera position. Together, $[R\,|\,t]$ is a rigid-body transformation — it preserves distances and angles.
1 — Extrinsics: World → Camera Space
The extrinsic matrix $[R\,|\,t]$ answers: where is the camera sitting in the world, and which way is it pointing?
- $R$ — a $3\times3$ rotation matrix describing the camera's orientation. Its columns are the camera's local X, Y, Z axes expressed in world coordinates.
- $t$ — a $3\times1$ translation vector. In the OpenCV convention, $t = -R\,C$ where $C$ is the camera's world-space position. It is not the camera position directly.
Interact: Move the camera around the world scene below. The colored boxes are fixed reference objects. Watch how the rotation matrix $R$ changes as you turn the camera, and how $t$ tracks its position. The small inset shows what the camera actually sees.
Key Insight: $R$ Is Orthogonal
Because $R$ is a pure rotation, its columns are mutually perpendicular unit vectors. This means $R^{-1} = R^T$ — you can undo any rotation just by transposing. The determinant of $R$ is always $+1$ (no reflection, no scaling).
2 — Intrinsics: Camera Space → Image Plane
The intrinsic matrix $K$ encodes the physical properties of the camera's sensor and lens. It maps the 3D camera-space point $(X_c, Y_c, Z_c)$ to a 2D image point.
The four parameters that matter most:
- $f_x, f_y$ — Focal lengths in pixels. Higher = more zoom. For square pixels $f_x = f_y$.
- $c_x, c_y$ — Principal point. Where the optical axis hits the sensor, usually near the image center $(W/2, H/2)$.
- $\gamma$ — Skew. Zero for modern cameras. Handles non-rectangular pixels.
After multiplying, we get $w' = Z_c$ (the depth). Dividing $x'$ and $y'$ by $w'$ gives the final pixel.
Interact: The left panel shows the 3D pinhole geometry — drag to orbit. The right panel shows the image plane: where does your chosen 3D point land in pixels? Adjust the K matrix parameters and watch the dot move.
Why does $f_x$ behave like zoom?
$f_x$ multiplies the angle ratio $X_c/Z_c$. Doubling $f_x$ doubles the image size of every object — exactly like zooming in. This is why telephoto lenses (long focal length) make distant objects appear large.
3 — Perspective Divide: Why Far Objects Look Small
After applying $K$, we have the homogeneous coordinate $(x', y', w')$. Converting to actual pixels requires dividing by $w'$. Since the bottom row of $K$ is $[0\;0\;1]$, we always get $w' = Z_c$ — the depth. So:
Dividing by $Z_c$ is what makes this perspective projection. The deeper an object (larger $Z_c$), the smaller it appears in the image. This is the mathematical reason for the basic visual phenomenon: objects shrink with distance.
Proof via Similar Triangles
Consider a side view of the pinhole camera. A point at height $Y_c$ and depth $Z_c$ casts a ray through the pinhole. By similar triangles, the image height $y'$ satisfies:
Interact: Adjust the object height, depth, and focal length. Watch the two similar triangles (red = real, cyan = image) scale together, always maintaining the ratio $Y_c/Z_c = y'/f$.
Connecting $f$ to Field of View
Focal length and field of view (FoV) are inversely related. Given a sensor half-width $W/2$ and focal length $f$, the horizontal half-angle is:
A small $f_x$ → wide angle. A large $f_x$ → narrow telephoto. This is why wide-angle lenses seem to "fit more in the frame."
4 — Full Pipeline: End-to-End Calculator
Let's trace a real point through every step. Adjust the 3D world point and camera setup; each matrix multiplication is shown with the actual computed numbers.
Step 1 — Apply Extrinsics [R|t]
Step 2 — Apply Intrinsics K
Step 3 — Perspective Divide (÷$Z_c$)
$v = y' / Z_c = $ ... = ... px
5 — Lens Distortion (Real Cameras)
The K matrix assumes a perfect pinhole — a mathematical idealization. Real lenses introduce distortion, which makes straight lines curve in the image. The two dominant types:
- Radial distortion $(k_1, k_2, k_3)$: barrel or pincushion warping. A wide-angle lens bows lines outward (barrel distortion, $k_1 < 0$).
- Tangential distortion $(p_1, p_2)$: the lens is not perfectly parallel to the sensor.
Camera calibration (e.g., using a chessboard with OpenCV's calibrateCamera) estimates $K$ and the distortion coefficients simultaneously.
6 — From Photons to Pixels: ISO & Noise
Once the geometry is done, the physical photons hitting each pixel are converted to a digital value. The ISO setting is a digital gain applied to the raw photoelectron count. Amplifying the signal also amplifies sensor noise — which is why high-ISO photos look grainy.
Interact: Raise ISO and watch the scene brighten — then become noisy. At ISO 25600 the noise dominates.
7 — Do Real Cameras Run This Math?
Here is the answer most people get wrong: no. The matrix multiplications $[R\,|\,t]$ and $K$ are not computed inside the camera hardware. The perspective projection happens entirely through physics — light and optics do it automatically. The math is a model that lets software reason about what the physics did.
The Physical Pipeline Inside a Camera
Click each stage below for details. Notice where the camera geometry math maps onto the physics, and where actual computation begins.
① Scene — Light Travelling in Straight Lines
Photons bounce off every surface and travel outward in straight lines. This straight-line travel is the physical mechanism behind perspective projection — an object twice as far away subtends half the angle, so it appears half as large. No computation needed: the geometry is baked into the physics of light.
This is what $[R\,|\,t]$ and $K$ model mathematically: they describe the same geometry that photons obey naturally.
② Aperture & Lens — Optical Projection
The aperture (f-number) controls how much light enters. The lens bends incoming rays so that every photon from a single scene point converges to a single point on the sensor. This convergence is the physical implementation of focal length $f_x, f_y$.
A longer focal-length lens bends light at a shallower angle → smaller field of view → apparent zoom. A wider lens bends more aggressively → larger FoV → distortion at the edges (barrel/pincushion, modelled by $k_1, k_2$). Still no computation — this is pure optics.
③ CMOS/CCD Sensor — Photons → Electrons
Each pixel is a photodiode that accumulates electrons proportional to the number of photons hitting it during the exposure. Modern sensors use a Bayer colour filter array: a repeating RGGB mosaic so each physical pixel captures only one colour channel. A 12 MP sensor captures ~6 M green, ~3 M red, and ~3 M blue measurements — not 12 M full-colour pixels.
The ISO setting multiplies the charge before the next stage, amplifying both signal and noise.
④ ADC — Analog → Digital (RAW)
An Analog-to-Digital Converter reads the voltage on each photodiode and converts it to a binary integer. Consumer cameras use 10–14 bits per channel, giving 1024–16384 discrete brightness levels. The raw binary grid of single-channel values is the RAW file format — a direct readout of the sensor, before any image processing.
⑤ ISP — Image Signal Processor
The ISP is a dedicated chip (or SoC block) that runs image algorithms in real time at 30–120 fps. This is where actual computation first happens inside the camera:
- Demosaicing: interpolates the Bayer RGGB grid into full-colour RGB for every pixel (bilinear, AHD, or neural).
- Noise reduction: bilateral filtering, BM3D, or neural denoising.
- White balance: multiplies R, G, B channels by three calibrated scalars — a $3\times3$ diagonal matrix multiply per pixel.
- Sharpening & tone mapping: unsharp masking, HDR→SDR curve.
- Lens distortion correction: a per-pixel lookup table or polynomial warp using $k_1, k_2$ — this is the one place $K$-related math actually runs in hardware.
- Compression: JPEG (block DCT), HEIF (neural), H.264/H.265 for video.
⑥ Output — JPEG / RAW / Video Stream
The processed image exits as a JPEG, HEIF, RAW file, or a live MIPI/USB stream to a host computer. At this point the data is a 2D array of RGB values. All 3D depth information has been permanently lost in the projection — this is why recovering 3D structure from images (SfM, stereo, depth estimation) is a hard inverse problem.
On smartphones, a final NPU (Neural Processing Unit) pass may run computational photography: Night Sight, Super Res Zoom, computational bokeh, or scene-adaptive sharpening — these are matrix-heavy neural network operations on 2D images.
In software, not in the camera. The projection matrix is used whenever a program needs to reason about the relationship between 3D geometry and 2D images:
- Computer Vision (OpenCV, etc.): predicts where a known 3D world point will appear in the image ($u,v = K[R|t]\,P_w$), or triangulates 3D points from multiple images (inverse).
- SLAM / 3D Reconstruction: estimates $R, t$ of each camera frame and recovers 3D structure simultaneously, iterating $P$ at every frame.
- Augmented Reality: renders virtual 3D objects at the correct 2D screen position using $P$, in real time at 60–90 fps.
- Autonomous Vehicles: dedicated GPU/FPGA accelerators compute $K[R|t]$ for LiDAR-camera fusion, object detection, and occupancy grids at 100+ fps.
- Camera Calibration: OpenCV's
calibrateCamerasolves for $K$, $k_1, k_2$, and $[R|t]$ by minimising reprojection error on a chessboard — this is optimisation over $P$, not a single matrix multiply.
Real-Time Performance
Inside the Camera (ISP)
Dedicated fixed-function hardware. Processes 50 MP at 30 fps (1.5 Gpixels/s). No general-purpose CPU involved in demosaic/denoise/compress. Uses 100–500 mW.
In a Vision System (GPU)
An NVIDIA Jetson can run full camera projection + neural detection on 4 cameras at 30 fps. A data-centre GPU processes batch camera projections for 3D reconstruction at thousands of fps equivalent.
Smartphone Computational Photo
Apple A17 Pro NPU: ~35 TOPS for neural-network-based denoising, super-resolution, and bokeh synthesis. Runs in parallel with the ISP on each captured frame.
Camera Calibration Workflow
Capture 20–50 images of a chessboard → OpenCV detects corners → non-linear optimisation estimates $K$, $k_1$–$k_3$, $p_1$–$p_2$ → reprojection error typically < 0.5 px RMS.