3D Computer Vision · Geometry
Understanding 3D Shape Representations
From raw point clouds to neural implicit fields — every major way computers describe 3D geometry, with interactive 3D visualizations throughout.
Before a computer can reason about a 3D object — render it, reconstruct it, measure it, or learn from it — it needs a way to represent that object in memory. There is no single perfect representation. Every format makes trade-offs: memory versus detail, rendering speed versus editability, discrete versus continuous. This article builds your intuition for each from the ground up, then shows you the math for comparing and regularizing them.
1. How Computers Describe 3D Shapes
We cover five representations. Switch between them below — all show the same underlying shape.
1.1 Point Cloud
A point cloud is the simplest representation: an unordered set $\mathcal{P}=\{p_i\in\mathbb{R}^3\}_{i=1}^N$. LiDAR scanners, depth cameras, and photogrammetry all produce point clouds natively.
Advantages
Sensor-native (LiDAR, RGBD). No topology needed. Easy to concatenate and transform. $O(N)$ memory.
Disadvantages
No surface connectivity — normals, edges, and interiors require extra work. Non-uniform density. Hard to render directly.
1.2 Mesh
A polygon mesh pairs a vertex set $V\subset\mathbb{R}^3$ with a face set $F$ of vertex-index triples. Almost exclusively triangles are used because GPUs are optimized for them. The wireframe overlay and vertex dots in the scene above make the triangle structure explicit.
Advantages
GPU-native rasterization. Explicit surface — normals, UV coords, and curvature are analytic. Memory-efficient for smooth shapes.
Disadvantages
Topology changes (holes, merges) are hard. Fixed connectivity is brittle for learning. Quality depends on mesh regularity.
1.3 Voxel Grid
A voxel grid divides 3D space into a cubic lattice where each cell stores occupancy or a scalar. The resolution slider above shows the $O(N^3)$ memory problem directly.
Advantages
Regular structure enables 3D convolutions. Boolean operations trivial. Simple occupancy queries.
Disadvantages
$O(N^3)$ memory — most voxels empty. Discretization artifacts. $256^3$ grid at float32 = 64 MB per channel.
1.4 KD-Tree
A KD-Tree is a spatial index built on a point cloud. It recursively splits space with axis-aligned hyperplanes, enabling $O(\log N)$ nearest-neighbor queries. The interactive 2D demo below lets you click to query the nearest point — the search visits only the highlighted cells.
1.5 Tri-plane Representation
The tri-plane (Chan et al., EG3D 2022) projects a 3D query point $p=(x,y,z)$ onto three axis-aligned feature planes and bilinearly interpolates a feature from each. Drag the sliders above (switch to the Tri-plane tab) to move the query point and watch the projections update live.
What is $\pi_{XY}$? A simple orthographic projection — it just drops one coordinate. $\pi_{XY}(p)=(x,y)$ discards $z$. No parameters, no learning. Think of shining a light along the $Z$-axis and looking at the shadow on the $XY$ floor.
What is $f_{XY}$? A learned 2D feature texture — a grid of $H\times W$ cells, each storing a $C$-dimensional vector. Nothing more than a 2D array of numbers (like an image with $C$ channels) that is jointly optimized during training. It is not a neural network; it is a parameter tensor. Reading a value from it at a non-integer position uses standard GPU bilinear interpolation — exactly the same as sampling a texture in a game engine.
Step 1 — Project: $\pi_{XY}(p)=(x,y)$, $\;\pi_{XZ}(p)=(x,z)$, $\;\pi_{YZ}(p)=(y,z)$. Three 2D coordinates, one per plane.
Step 2 — Sample: Bilinear-interpolate each feature texture at the projected coordinate → three $C$-dim feature vectors $\mathbf{f}_1,\mathbf{f}_2,\mathbf{f}_3$.
Step 3 — Aggregate: $\mathbf{f} = \mathbf{f}_1+\mathbf{f}_2+\mathbf{f}_3$ (element-wise sum).
Step 4 — Decode: Pass $\mathbf{f}$ through a small MLP (the only neural network in the pipeline) that outputs density $\sigma$ and colour $\mathbf{c}$.
Advantages
$128\times$ cheaper than a $128^3$ voxel grid ($3\times H\times W\times C$ vs $N^3\times C$). GPU-native bilinear sampling. Differentiable end-to-end. Training is fast because gradients flow through simple texture lookups.
Disadvantages
Axis-aligned bias — a feature uncorrelated across all three planes (e.g. a diagonal feature) cannot be captured. Slightly less expressive than full 3D voxel grids at the same resolution. The three planes must share the same spatial extent.
| Representation | Memory | Differentiable | Renderable | Connectivity |
|---|---|---|---|---|
| Point Cloud | O(N) | Yes | Splatting | None |
| Mesh | O(V+F) | Partial | GPU native | Explicit |
| Voxel Grid | O(N³) | Yes | Raycasting | Implicit |
| KD-Tree | O(N) | No | Indirect | Spatial |
| Tri-plane | O(3HWC) | Yes | Via MLP | Implicit |
2. Surface Normals and the Signed Distance Function
Surface Normals
At any smooth point on a surface the normal vector $\mathbf{n}$ is the unit vector perpendicular to the tangent plane. The three tabs below show three fundamentally different normal modes — switch between them to see the visual difference directly.
Face normals — flat shading
Every point within a triangle shares the same normal — perpendicular to the face plane. This gives the angular, faceted look you see above. You can literally count the triangles from the lighting breaks.
Vertex normals — smooth shading
Adjacent face normals are averaged at each vertex. The GPU interpolates (Phong interpolation) across the face. The surface appears smooth even though the geometry is identical — just coarser arrows, smoother shading.
A normal vector has components $(n_x, n_y, n_z)\in[-1,1]^3$. We cannot store negative numbers in a pixel (which lives in $[0,1]$), so we linearly remap each component:
When a component is $+1$ the channel is $1.0$ (bright); when it is $-1$ the channel is $0.0$ (black); when it is $0$ the channel is $0.5$ (medium grey). Six canonical directions decode to:
| Normal direction | $(n_x,\,n_y,\,n_z)$ | RGB = (n+1)/2 | Colour |
|---|---|---|---|
| Right (+X) | (+1, 0, 0) | (1.0, 0.5, 0.5) | bright red |
| Left (−X) | (−1, 0, 0) | (0.0, 0.5, 0.5) | dark teal |
| Up (+Y) | (0, +1, 0) | (0.5, 1.0, 0.5) | bright green |
| Down (−Y) | (0, −1, 0) | (0.5, 0.0, 0.5) | dark purple |
| Toward you (+Z) | (0, 0, +1) | (0.5, 0.5, 1.0) | bright blue |
| Away from you (−Z) | (0, 0, −1) | (0.5, 0.5, 0.0) | olive/dark |
The sphere above shows all of this at once: the right hemisphere is reddish, the top is greenish, and the front-facing area is bluish. Diagonals blend smoothly. This encoding is used in every tangent-space normal map in games, in NeRF normal supervision, and in neural rendering quality metrics.
Signed Distance Function (SDF)
The SDF encodes the signed distance from any point in space to the nearest surface point. Move the query point (yellow sphere) with the sliders — watch the SDF value and the yellow line (shortest path to surface) update in real time.
● query outside (SDF > 0)
● on surface (SDF ≈ 0)
— nearest-surface line
3. Measuring Distance Between Two Shapes
Training any 3D generative or reconstruction model requires a loss that measures how different two point sets are. Three metrics dominate: Chamfer (differentiable average), Hausdorff (worst-case coverage), and Earth Mover's Distance (optimal transport). Each captures a fundamentally different geometric property.
3.1 Chamfer Distance
Every point in $P$ finds its nearest neighbor in $Q$, and vice versa. The Chamfer Distance averages all these squared distances. Differentiable almost everywhere — the standard training loss. Key flaw: multiple points in $P$ can match the same point in $Q$ (many-to-one), so local clumping is rewarded.
3.2 Hausdorff Distance
The worst-case nearest-neighbor distance. One outlier dominates — 99% perfect coverage does not help. The scenarios below show exactly when this matters: try the "One outlier" preset to see Chamfer stay flat while Hausdorff jumps dramatically.
3.3 Earth Mover's Distance (EMD)
EMD enforces a bijection $\phi$ — each point in $P$ matches to a unique point in $Q$ (one-to-one). This is the minimum-cost way to "transport" mass from $P$ to $Q$, solved exactly with the Hungarian algorithm. Requires $|P|=|Q|$. EMD $\geq$ Chamfer always, because Chamfer's many-to-one matching is a relaxation of the bijection constraint.
In Chamfer mode, many blue P-points may all draw their line to the same orange Q-point (the shared nearest neighbor). In EMD mode, each Q-point is matched to exactly one P-point. The optimal bijection "spreads" assignments more evenly — watch how the lines change when you switch modes.
Drag any point. Switch metrics or try a preset scenario to see how each metric responds differently.
Chamfer — training losses
Differentiable, KD-tree efficient, robust to small outliers. Standard in PointNet, FoldingNet, occupancy networks. Many-to-one bias can cause "collapsed" reconstructions.
Hausdorff — worst-case coverage
One large gap dominates. Critical for safety (autonomous driving, medical). Never used as a training loss — non-differentiable and noise-sensitive.
EMD — balanced transport
Bijection prevents collapse; measures the true cost of rearranging $P$ into $Q$. More meaningful for generative models. $O(n^3)$ Hungarian algorithm — expensive for large sets.
EMD ≥ Chamfer always
Chamfer relaxes the bijection. If nearest-neighbor matching is already bijective, $\text{EMD}=\text{CD}$. Otherwise the forced bijection in EMD forces some assignments to worse matches.
4. Explicit vs. Implicit Representations
All 3D representations fall into one of two philosophies. Explicit: directly list the surface elements. Implicit: define a scalar field whose zero level set is the surface. The 3D side-by-side below renders the same torus shape both ways in real time.
Mesh, point cloud, NURBS — every element directly describes part of the surface. Hand to GPU rasterizer, done.
SDF, occupancy network, NeRF — the surface is $\{x:f(x)=0\}$. Must ray-march or run Marching Cubes to get geometry. But topology changes, interior queries, and blending are free.
Ray Marching — How Implicit Surfaces Are Rendered
Since an implicit surface has no triangles to rasterize, the GPU must search for where a ray intersects the zero level set. Sphere tracing is the canonical algorithm: at each position along the ray, evaluate the SDF. Because the SDF tells you the exact distance to the nearest surface point, you can safely advance the ray by exactly that amount without overshooting — each step takes the largest possible safe jump.
The key insight: when you are far from the surface, $f$ is large and you take big steps. Near the surface, $f\to 0$ and you take tiny, precise steps. This converges much faster than any fixed-step ray marching. Click anywhere on the scene below to fire a ray and watch sphere tracing step by step.
→ step forward by SDF
● hit (SDF < ε)
● miss (t > t_max)
Click anywhere to aim a ray from the eye (white dot, left). Each yellow circle has radius = SDF at that point — the ray can safely step by that amount. Background heatmap: blue = inside, red = outside.
The right 3D panel below uses exactly this algorithm in a GLSL fragment shader — every pixel fires a ray and sphere-traces the torus SDF. The torus is never stored as triangles; it is defined by a single mathematical function evaluated per pixel.
| Explicit | Implicit | |
|---|---|---|
| Surface access | Directly listed | Extract (Marching Cubes) or ray-march |
| Topology change | Very hard | Natural — level set evolves |
| Resolution | Fixed at creation | Query at any resolution |
| GPU rendering | Rasterization | Ray marching / vol. rendering |
| Inside/outside | Expensive (ray casting) | Free — just check sign of $f$ |
Conclusion
We have traveled from the simplest possible 3D representation — an unordered set of points — to live ray-marched implicit surfaces. The connecting thread:
- Point clouds are what sensors give you. Everything else adds structure: connectivity (mesh), spatial index (KD-tree, voxel), or compactness (tri-plane).
- Normals bridge geometry and shading. Face normals (one per triangle, flat look) vs. vertex normals (averaged at shared vertices, smooth look) are the same geometry rendered differently. The normal map encoding $(\mathbf{n}+1)/2\to\text{RGB}$ is universal in graphics and neural rendering.
- Chamfer, Hausdorff, and EMD are three lenses on shape similarity. Chamfer sees average coverage (use for training). Hausdorff sees worst-case gaps (use for evaluation). EMD enforces a fair bijection (use for generative model quality).
- Sphere tracing is what makes implicit surfaces renderable: step by the SDF value, converging faster far from the surface and slowing near it. The right panel above renders an entire torus this way — no triangles, pure math.
- Explicit vs. implicit is not binary. 3DGS is explicitly positioned but implicitly optimized. NeRF is implicitly defined but explicitly ray-marched.
DeepSDF (Park et al., 2019) · EG3D Tri-plane (Chan et al., 2022) · NeRF (Mildenhall et al., 2020) · 3D Gaussian Splatting (Kerbl et al., 2023) · PointNet (Qi et al., 2017)
All visualizations built with Three.js (3D) and Canvas 2D. Math rendered with KaTeX. Shapes generated procedurally. Written with the help of Claude.