Computer Vision Fundamentals

The Geometry of a Camera

How a camera maps the 3D world onto a 2D pixel grid — step by step, with interactive visualizations for every stage.

The Full Pipeline at a Glance

A 3D world point travels through four transformations before becoming a pixel on your screen. Each stage is handled by a matrix multiplication — and we can compose them all into a single $3 \times 4$ camera matrix $P$.

3D World Point

$(X_w,Y_w,Z_w)$

World Space

[R|t]

→

Extrinsics

3D Camera Point

$(X_c,Y_c,Z_c)$

Camera Space

→

Intrinsics

Homogeneous

$(x',y',w')$

Image Space

÷$Z_c$

→

Persp. Divide

2D Pixel

$(u,v)$

Screen Space

$$\begin{bmatrix}u \\ v \\ 1\end{bmatrix} \sim P \begin{bmatrix}X_w \\ Y_w \\ Z_w \\ 1\end{bmatrix},\qquad P = K\,[R\,|\,t]$$

The $\sim$ means "equal up to scale" (the homogeneous divide). Let's build each piece from scratch.

Primer: Homogeneous Coordinates

To rotate and translate a point in one matrix multiply, we need a trick. In 3D Cartesian coordinates:

Rotation: multiply by a $3\times3$ matrix $R$.
Translation: add a vector $t$. You cannot embed addition into a matrix multiply in 3D.

The fix: lift the point into 4D by appending a $1$. Now both rotation and translation fit into a single $3\times4$ multiply.

$$\underbrace{\begin{bmatrix}X_w\\Y_w\\Z_w\\1\end{bmatrix}}_{\text{Homogeneous}} \xrightarrow{\times\,[R|t]} \begin{bmatrix}X_c\\Y_c\\Z_c\end{bmatrix} = R\begin{bmatrix}X_w\\Y_w\\Z_w\end{bmatrix}+t$$

When you convert back, just divide by the last coordinate. Any scalar multiple $\lambda(x,y,w)^T$ represents the same 2D point $(x/w,\,y/w)$.

0 — World Space: Where Does Everything Live?

Before any camera math, we need a shared stage: the world coordinate system. It is a fixed reference frame defined by a chosen origin point and three perpendicular axes (X, Y, Z). Every object, every camera, and every measurement in your scene is expressed relative to this one frame.

The Origin Is Your Choice

There is no "correct" world origin. It is wherever you decide it should be:

A robotics lab might set the origin at the robot's starting position.
A construction site uses a surveyed ground-control point.
A film studio might use the center of the stage.
A self-driving car stack resets the origin to the car's GPS position at boot time.

Once chosen, every 3D point $(X_w, Y_w, Z_w)$ is measured in your chosen units (meters, millimetres, etc.) along your chosen axes. Crucially, the origin choice changes the numbers but not the physics — the objects don't move, only how we describe them changes.

Interact: Click a preset below to move the world origin. The boxes stay in place, but their coordinate labels update — because coordinates are always relative to the origin.

World Origin & Frame

Origin $O$(0.0, 0.0, 0.0)

Axes

        ■ X → East / Right

        ■ Y → Up

        ■ Z → South / Forward

Object Coords in This Frame

Point Coordinates Change, Distances Don't

Moving the origin changes $(X_w, Y_w, Z_w)$ values, but it never changes the distance between two objects, or the angle between two directions. Those are intrinsic geometric properties. This is why you can freely pick any convenient origin — it only affects bookkeeping, not geometry.

$$\text{If origin moves by } \Delta O,\quad X_w^{\text{new}} = X_w^{\text{old}} - \Delta O_x \;\text{(and similarly Y, Z)}$$

Axis Conventions Vary by Field

The orientation of the axes is also a convention. Different communities disagree — which causes bugs whenever you mix tools. The key rule: always check which convention a library uses before plugging in numbers.

System	X	Y	Z	Camera looks toward
OpenCV (this blog)	Right	Down ↓	Into scene	+Z
OpenGL / WebGL / Three.js	Right	Up ↑	Out of screen	−Z
ROS / Robotics	Forward	Left	Up ↑	+X
Unreal Engine	Forward	Right	Up ↑	+X
Unity 3D	Right	Up ↑	Into scene	+Z

From World Space to Camera Space

A camera is just another object living at some position $\mathbf{C} = (C_x, C_y, C_z)$ in world space, pointing in some direction. The job of the extrinsic matrix is to express every world point relative to the camera instead of relative to the world origin. After this transformation, the camera sits at the origin and looks along +Z — which is the coordinate frame the intrinsic matrix $K$ expects.

$$\underbrace{P_{\text{camera}}}_{\text{point relative to camera}} = R\,\underbrace{P_{\text{world}}}_{\text{point in world}} + \underbrace{t}_{t = -R\,C}$$

$R$ rotates world axes to align with camera axes. $t$ shifts the origin from the world origin to the camera position. Together, $[R\,|\,t]$ is a rigid-body transformation — it preserves distances and angles.

1 — Extrinsics: World → Camera Space

The extrinsic matrix $[R\,|\,t]$ answers: where is the camera sitting in the world, and which way is it pointing?

$R$ — a $3\times3$ rotation matrix describing the camera's orientation. Its columns are the camera's local X, Y, Z axes expressed in world coordinates.
$t$ — a $3\times1$ translation vector. In the OpenCV convention, $t = -R\,C$ where $C$ is the camera's world-space position. It is not the camera position directly.

$$\begin{bmatrix}X_c\\Y_c\\Z_c\end{bmatrix} = \underbrace{\begin{bmatrix}r_{11}&r_{12}&r_{13}&t_x\\r_{21}&r_{22}&r_{23}&t_y\\r_{31}&r_{32}&r_{33}&t_z\end{bmatrix}}_{[R\,|\,t]}\begin{bmatrix}X_w\\Y_w\\Z_w\\1\end{bmatrix}$$

Interact: Move the camera around the world scene below. The colored boxes are fixed reference objects. Watch how the rotation matrix $R$ changes as you turn the camera, and how $t$ tracks its position. The small inset shows what the camera actually sees.

Camera POV

Translation $t$ (Camera Position in World)

$t_x$0.00

$t_y$2.00

$t_z$4.00

Rotation Matrix $R$ (Camera Axes → World)

[

1.000.000.00 0.001.000.00 0.000.001.00

]

Columns = camera X(red), Y(green), Z(blue) in world

Camera X (t_x)0.0

Camera Y (t_y)2.0

Camera Z (t_z)4.0

Yaw (turn left/right)0°

Pitch (tilt up/down)0°

Roll (rotate)0°

Key Insight: $R$ Is Orthogonal

Because $R$ is a pure rotation, its columns are mutually perpendicular unit vectors. This means $R^{-1} = R^T$ — you can undo any rotation just by transposing. The determinant of $R$ is always $+1$ (no reflection, no scaling).

2 — Intrinsics: Camera Space → Image Plane

The intrinsic matrix $K$ encodes the physical properties of the camera's sensor and lens. It maps the 3D camera-space point $(X_c, Y_c, Z_c)$ to a 2D image point.

$$\begin{bmatrix}x'\\y'\\w'\end{bmatrix} = \underbrace{\begin{bmatrix}f_x & \gamma & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1\end{bmatrix}}_{K} \begin{bmatrix}X_c\\Y_c\\Z_c\end{bmatrix}$$

The four parameters that matter most:

$f_x, f_y$ — Focal lengths in pixels. Higher = more zoom. For square pixels $f_x = f_y$.
$c_x, c_y$ — Principal point. Where the optical axis hits the sensor, usually near the image center $(W/2, H/2)$.
$\gamma$ — Skew. Zero for modern cameras. Handles non-rectangular pixels.

After multiplying, we get $w' = Z_c$ (the depth). Dividing $x'$ and $y'$ by $w'$ gives the final pixel.

$$u = \frac{x'}{w'} = f_x\frac{X_c}{Z_c} + c_x, \qquad v = \frac{y'}{w'} = f_y\frac{Y_c}{Z_c} + c_y$$

Interact: The left panel shows the 3D pinhole geometry — drag to orbit. The right panel shows the image plane: where does your chosen 3D point land in pixels? Adjust the K matrix parameters and watch the dot move.

Pinhole Geometry (Camera Space)

Object $X_c$0.0

Object $Y_c$1.5

Object $Z_c$ (depth)8.0

$X_c/Z_c$0.000

$Y_c/Z_c$0.188

Image Plane (640×480 sensor)

Drag the 2D dot to move the 3D point

K Matrix (Live)

u = fx·(Xc/Zc) + cx = 320
v = fy·(Yc/Zc) + cy = 334

K Matrix Parameters

$f_x$ (focal len X)500

$f_y$ (focal len Y)500

$c_x$ (principal pt X)320

$c_y$ (principal pt Y)240

3D Point in Camera Space

$X_c$ (left/right)0.0

$Y_c$ (up/down)1.5

$Z_c$ (depth — farther=smaller)8.0

Why does $f_x$ behave like zoom?

$f_x$ multiplies the angle ratio $X_c/Z_c$. Doubling $f_x$ doubles the image size of every object — exactly like zooming in. This is why telephoto lenses (long focal length) make distant objects appear large.

3 — Perspective Divide: Why Far Objects Look Small

After applying $K$, we have the homogeneous coordinate $(x', y', w')$. Converting to actual pixels requires dividing by $w'$. Since the bottom row of $K$ is $[0\;0\;1]$, we always get $w' = Z_c$ — the depth. So:

$$u = \frac{x'}{Z_c},\quad v = \frac{y'}{Z_c}$$

Dividing by $Z_c$ is what makes this perspective projection. The deeper an object (larger $Z_c$), the smaller it appears in the image. This is the mathematical reason for the basic visual phenomenon: objects shrink with distance.

Proof via Similar Triangles

Consider a side view of the pinhole camera. A point at height $Y_c$ and depth $Z_c$ casts a ray through the pinhole. By similar triangles, the image height $y'$ satisfies:

$$\frac{y'}{f} = \frac{Y_c}{Z_c} \quad\Longrightarrow\quad y' = f \cdot \frac{Y_c}{Z_c}$$

Interact: Adjust the object height, depth, and focal length. Watch the two similar triangles (red = real, cyan = image) scale together, always maintaining the ratio $Y_c/Z_c = y'/f$.

Object Height $Y_c$3.0

Object Depth $Z_c$ (distance)8.0

Focal Length $f$3.0

    y' = f × Y/Z = 3.0 × 3.0 / 8.0 = 1.125
      |   Y/Z ratio = 0.375
  

Connecting $f$ to Field of View

Focal length and field of view (FoV) are inversely related. Given a sensor half-width $W/2$ and focal length $f$, the horizontal half-angle is:

$$\text{FoV}_h = 2\arctan\!\left(\frac{W/2}{f_x}\right)$$

A small $f_x$ → wide angle. A large $f_x$ → narrow telephoto. This is why wide-angle lenses seem to "fit more in the frame."

4 — Full Pipeline: End-to-End Calculator

Let's trace a real point through every step. Adjust the 3D world point and camera setup; each matrix multiplication is shown with the actual computed numbers.

Live Pipeline Calculator: $(X_w,Y_w,Z_w,1)^T \xrightarrow{K[R|t]} (u,v)$

World $X_w$2.0

World $Y_w$1.5

World $Z_w$-5.0

Camera $f_x = f_y$600

Camera Yaw0°

Step 1 — Apply Extrinsics [R|t]

...

Result: Camera coords ...

Step 2 — Apply Intrinsics K

...

Raw: Homogeneous $(x', y', w')$ ...

Step 3 — Perspective Divide (÷$Z_c$)

          $u = x' / Z_c = $ ... = ... px

          $v = y' / Z_c = $ ... = ... px

640×480

5 — Lens Distortion (Real Cameras)

The K matrix assumes a perfect pinhole — a mathematical idealization. Real lenses introduce distortion, which makes straight lines curve in the image. The two dominant types:

Radial distortion $(k_1, k_2, k_3)$: barrel or pincushion warping. A wide-angle lens bows lines outward (barrel distortion, $k_1 < 0$).
Tangential distortion $(p_1, p_2)$: the lens is not perfectly parallel to the sensor.

$$r^2 = x_n^2 + y_n^2,\qquad x_{\text{dist}} = x_n(1 + k_1 r^2 + k_2 r^4) + 2p_1 x_n y_n + p_2(r^2 + 2x_n^2)$$

Camera calibration (e.g., using a chessboard with OpenCV's calibrateCamera) estimates $K$ and the distortion coefficients simultaneously.

6 — From Photons to Pixels: ISO & Noise

Once the geometry is done, the physical photons hitting each pixel are converted to a digital value. The ISO setting is a digital gain applied to the raw photoelectron count. Amplifying the signal also amplifies sensor noise — which is why high-ISO photos look grainy.

$$\text{Pixel Value} = \text{clamp}\!\left(\text{ISO}_{\text{gain}} \times (\text{Photons} + \mathcal{N}(0,\sigma)),\;0,\;255\right)$$

Interact: Raise ISO and watch the scene brighten — then become noisy. At ISO 25600 the noise dominates.

Digital Signal

Base Photons15 (dark scene)

ISO100

Gain ($×$)1.0×

Noise $\sigma$0

Brightness15/255

ISO (Digital Gain) — higher = brighter but noisier100

7 — Do Real Cameras Run This Math?

Here is the answer most people get wrong: no. The matrix multiplications $[R\,|\,t]$ and $K$ are not computed inside the camera hardware. The perspective projection happens entirely through physics — light and optics do it automatically. The math is a model that lets software reason about what the physics did.

The Physical Pipeline Inside a Camera

Click each stage below for details. Notice where the camera geometry math maps onto the physics, and where actual computation begins.

① Scene — Light Travelling in Straight Lines

Photons bounce off every surface and travel outward in straight lines. This straight-line travel is the physical mechanism behind perspective projection — an object twice as far away subtends half the angle, so it appears half as large. No computation needed: the geometry is baked into the physics of light.

This is what $[R\,|\,t]$ and $K$ model mathematically: they describe the same geometry that photons obey naturally.

② Aperture & Lens — Optical Projection

The aperture (f-number) controls how much light enters. The lens bends incoming rays so that every photon from a single scene point converges to a single point on the sensor. This convergence is the physical implementation of focal length $f_x, f_y$.

A longer focal-length lens bends light at a shallower angle → smaller field of view → apparent zoom. A wider lens bends more aggressively → larger FoV → distortion at the edges (barrel/pincushion, modelled by $k_1, k_2$). Still no computation — this is pure optics.

③ CMOS/CCD Sensor — Photons → Electrons

Each pixel is a photodiode that accumulates electrons proportional to the number of photons hitting it during the exposure. Modern sensors use a Bayer colour filter array: a repeating RGGB mosaic so each physical pixel captures only one colour channel. A 12 MP sensor captures ~6 M green, ~3 M red, and ~3 M blue measurements — not 12 M full-colour pixels.

The ISO setting multiplies the charge before the next stage, amplifying both signal and noise.

④ ADC — Analog → Digital (RAW)

An Analog-to-Digital Converter reads the voltage on each photodiode and converts it to a binary integer. Consumer cameras use 10–14 bits per channel, giving 1024–16384 discrete brightness levels. The raw binary grid of single-channel values is the RAW file format — a direct readout of the sensor, before any image processing.

⑤ ISP — Image Signal Processor

The ISP is a dedicated chip (or SoC block) that runs image algorithms in real time at 30–120 fps. This is where actual computation first happens inside the camera:

Demosaicing: interpolates the Bayer RGGB grid into full-colour RGB for every pixel (bilinear, AHD, or neural).
Noise reduction: bilateral filtering, BM3D, or neural denoising.
White balance: multiplies R, G, B channels by three calibrated scalars — a $3\times3$ diagonal matrix multiply per pixel.
Sharpening & tone mapping: unsharp masking, HDR→SDR curve.
Lens distortion correction: a per-pixel lookup table or polynomial warp using $k_1, k_2$ — this is the one place $K$-related math actually runs in hardware.
Compression: JPEG (block DCT), HEIF (neural), H.264/H.265 for video.

⑥ Output — JPEG / RAW / Video Stream

The processed image exits as a JPEG, HEIF, RAW file, or a live MIPI/USB stream to a host computer. At this point the data is a 2D array of RGB values. All 3D depth information has been permanently lost in the projection — this is why recovering 3D structure from images (SfM, stereo, depth estimation) is a hard inverse problem.

On smartphones, a final NPU (Neural Processing Unit) pass may run computational photography: Night Sight, Super Res Zoom, computational bokeh, or scene-adaptive sharpening — these are matrix-heavy neural network operations on 2D images.

So where does $P = K\,[R\,|\,t]$ actually run?

In software, not in the camera. The projection matrix is used whenever a program needs to reason about the relationship between 3D geometry and 2D images:

Computer Vision (OpenCV, etc.): predicts where a known 3D world point will appear in the image ($u,v = K[R|t]\,P_w$), or triangulates 3D points from multiple images (inverse).
SLAM / 3D Reconstruction: estimates $R, t$ of each camera frame and recovers 3D structure simultaneously, iterating $P$ at every frame.
Augmented Reality: renders virtual 3D objects at the correct 2D screen position using $P$, in real time at 60–90 fps.
Autonomous Vehicles: dedicated GPU/FPGA accelerators compute $K[R|t]$ for LiDAR-camera fusion, object detection, and occupancy grids at 100+ fps.
Camera Calibration: OpenCV's calibrateCamera solves for $K$, $k_1, k_2$, and $[R|t]$ by minimising reprojection error on a chessboard — this is optimisation over $P$, not a single matrix multiply.

Real-Time Performance

Inside the Camera (ISP)

Dedicated fixed-function hardware. Processes 50 MP at 30 fps (1.5 Gpixels/s). No general-purpose CPU involved in demosaic/denoise/compress. Uses 100–500 mW.

In a Vision System (GPU)

An NVIDIA Jetson can run full camera projection + neural detection on 4 cameras at 30 fps. A data-centre GPU processes batch camera projections for 3D reconstruction at thousands of fps equivalent.

Smartphone Computational Photo

Apple A17 Pro NPU: ~35 TOPS for neural-network-based denoising, super-resolution, and bokeh synthesis. Runs in parallel with the ISP on each captured frame.

Camera Calibration Workflow

Capture 20–50 images of a chessboard → OpenCV detects corners → non-linear optimisation estimates $K$, $k_1$–$k_3$, $p_1$–$p_2$ → reprojection error typically < 0.5 px RMS.