3D Computer Vision · Geometry

Understanding 3D Shape Representations

From raw point clouds to neural implicit fields — every major way computers describe 3D geometry, with interactive 3D visualizations throughout.

Before a computer can reason about a 3D object — render it, reconstruct it, measure it, or learn from it — it needs a way to represent that object in memory. There is no single perfect representation. Every format makes trade-offs: memory versus detail, rendering speed versus editability, discrete versus continuous. This article builds your intuition for each from the ground up, then shows you the math for comparing and regularizing them.

1. How Computers Describe 3D Shapes

We cover five representations. Switch between them below — all show the same underlying shape.

Point Cloud

1.1 Point Cloud

A point cloud is the simplest representation: an unordered set $\mathcal{P}=\{p_i\in\mathbb{R}^3\}_{i=1}^N$. LiDAR scanners, depth cameras, and photogrammetry all produce point clouds natively.

$$\mathcal{P}=\{(x_i,y_i,z_i)\}_{i=1}^N\subset\mathbb{R}^3$$

Advantages

Sensor-native (LiDAR, RGBD). No topology needed. Easy to concatenate and transform. $O(N)$ memory.

Disadvantages

No surface connectivity — normals, edges, and interiors require extra work. Non-uniform density. Hard to render directly.

1.2 Mesh

A polygon mesh pairs a vertex set $V\subset\mathbb{R}^3$ with a face set $F$ of vertex-index triples. Almost exclusively triangles are used because GPUs are optimized for them. The wireframe overlay and vertex dots in the scene above make the triangle structure explicit.

$$\text{Mesh}=(V,F),\quad V\subset\mathbb{R}^3,\quad F\subset\binom{[|V|]}{3}$$

Advantages

GPU-native rasterization. Explicit surface — normals, UV coords, and curvature are analytic. Memory-efficient for smooth shapes.

Disadvantages

Topology changes (holes, merges) are hard. Fixed connectivity is brittle for learning. Quality depends on mesh regularity.

1.3 Voxel Grid

A voxel grid divides 3D space into a cubic lattice where each cell stores occupancy or a scalar. The resolution slider above shows the $O(N^3)$ memory problem directly.

$$V[i,j,k]\in\{0,1\}\text{ (occupancy)},\quad\text{Memory: }O(N^3)$$

Advantages

Regular structure enables 3D convolutions. Boolean operations trivial. Simple occupancy queries.

Disadvantages

$O(N^3)$ memory — most voxels empty. Discretization artifacts. $256^3$ grid at float32 = 64 MB per channel.

1.4 KD-Tree

A KD-Tree is a spatial index built on a point cloud. It recursively splits space with axis-aligned hyperplanes, enabling $O(\log N)$ nearest-neighbor queries. The interactive 2D demo below lets you click to query the nearest point — the search visits only the highlighted cells.

KD-Tree — 2D Interactive (click to query)

1.5 Tri-plane Representation

The tri-plane (Chan et al., EG3D 2022) projects a 3D query point $p=(x,y,z)$ onto three axis-aligned feature planes and bilinearly interpolates a feature from each. Drag the sliders above (switch to the Tri-plane tab) to move the query point and watch the projections update live.

$$F(p)=f_{XY}(\pi_{XY}(p))+f_{XZ}(\pi_{XZ}(p))+f_{YZ}(\pi_{YZ}(p))$$ $$\pi_{XY}(p)=(x,y),\quad\pi_{XZ}(p)=(x,z),\quad\pi_{YZ}(p)=(y,z)$$

What is $\pi_{XY}$? A simple orthographic projection — it just drops one coordinate. $\pi_{XY}(p)=(x,y)$ discards $z$. No parameters, no learning. Think of shining a light along the $Z$-axis and looking at the shadow on the $XY$ floor.

What is $f_{XY}$? A learned 2D feature texture — a grid of $H\times W$ cells, each storing a $C$-dimensional vector. Nothing more than a 2D array of numbers (like an image with $C$ channels) that is jointly optimized during training. It is not a neural network; it is a parameter tensor. Reading a value from it at a non-integer position uses standard GPU bilinear interpolation — exactly the same as sampling a texture in a game engine.

The full query pipeline

Step 1 — Project: $\pi_{XY}(p)=(x,y)$, $\;\pi_{XZ}(p)=(x,z)$, $\;\pi_{YZ}(p)=(y,z)$. Three 2D coordinates, one per plane.
Step 2 — Sample: Bilinear-interpolate each feature texture at the projected coordinate → three $C$-dim feature vectors $\mathbf{f}_1,\mathbf{f}_2,\mathbf{f}_3$.
Step 3 — Aggregate: $\mathbf{f} = \mathbf{f}_1+\mathbf{f}_2+\mathbf{f}_3$ (element-wise sum).
Step 4 — Decode: Pass $\mathbf{f}$ through a small MLP (the only neural network in the pipeline) that outputs density $\sigma$ and colour $\mathbf{c}$.

Advantages

$128\times$ cheaper than a $128^3$ voxel grid ($3\times H\times W\times C$ vs $N^3\times C$). GPU-native bilinear sampling. Differentiable end-to-end. Training is fast because gradients flow through simple texture lookups.

Disadvantages

Axis-aligned bias — a feature uncorrelated across all three planes (e.g. a diagonal feature) cannot be captured. Slightly less expressive than full 3D voxel grids at the same resolution. The three planes must share the same spatial extent.

RepresentationMemoryDifferentiableRenderableConnectivity
Point CloudO(N)YesSplattingNone
MeshO(V+F)PartialGPU nativeExplicit
Voxel GridO(N³)YesRaycastingImplicit
KD-TreeO(N)NoIndirectSpatial
Tri-planeO(3HWC)YesVia MLPImplicit

2. Surface Normals and the Signed Distance Function

Surface Normals

At any smooth point on a surface the normal vector $\mathbf{n}$ is the unit vector perpendicular to the tangent plane. The three tabs below show three fundamentally different normal modes — switch between them to see the visual difference directly.

$$\mathbf{n}_{\text{face}}=\frac{(v_1-v_0)\times(v_2-v_0)}{|(v_1-v_0)\times(v_2-v_0)|},\qquad\mathbf{n}_{\text{vertex}}=\frac{\textstyle\sum_{f\ni v}\mathbf{n}_f}{|\textstyle\sum_{f\ni v}\mathbf{n}_f|}$$
Face Normals

Face normals — flat shading

Every point within a triangle shares the same normal — perpendicular to the face plane. This gives the angular, faceted look you see above. You can literally count the triangles from the lighting breaks.

Vertex normals — smooth shading

Adjacent face normals are averaged at each vertex. The GPU interpolates (Phong interpolation) across the face. The surface appears smooth even though the geometry is identical — just coarser arrows, smoother shading.

Color-coded Normals (Normal Map) — How the encoding works

A normal vector has components $(n_x, n_y, n_z)\in[-1,1]^3$. We cannot store negative numbers in a pixel (which lives in $[0,1]$), so we linearly remap each component:

$$R = \frac{n_x+1}{2},\quad G = \frac{n_y+1}{2},\quad B = \frac{n_z+1}{2}$$

When a component is $+1$ the channel is $1.0$ (bright); when it is $-1$ the channel is $0.0$ (black); when it is $0$ the channel is $0.5$ (medium grey). Six canonical directions decode to:

Normal direction $(n_x,\,n_y,\,n_z)$ RGB = (n+1)/2 Colour
Right (+X)(+1, 0, 0)(1.0, 0.5, 0.5)bright red
Left (−X)(−1, 0, 0)(0.0, 0.5, 0.5)dark teal
Up (+Y)(0, +1, 0)(0.5, 1.0, 0.5)bright green
Down (−Y)(0, −1, 0)(0.5, 0.0, 0.5)dark purple
Toward you (+Z)(0, 0, +1)(0.5, 0.5, 1.0)bright blue
Away from you (−Z)(0, 0, −1)(0.5, 0.5, 0.0)olive/dark

The sphere above shows all of this at once: the right hemisphere is reddish, the top is greenish, and the front-facing area is bluish. Diagonals blend smoothly. This encoding is used in every tangent-space normal map in games, in NeRF normal supervision, and in neural rendering quality metrics.

Signed Distance Function (SDF)

The SDF encodes the signed distance from any point in space to the nearest surface point. Move the query point (yellow sphere) with the sliders — watch the SDF value and the yellow line (shortest path to surface) update in real time.

$$\text{SDF}(x)=\begin{cases}-d(x,\mathcal{S})&x\text{ inside}\\\phantom{-}d(x,\mathcal{S})&x\text{ outside}\end{cases},\qquad d(x,\mathcal{S})=\min_{s\in\mathcal{S}}\|x-s\|_2$$
SDF — 3D Interactive (drag to orbit, sliders to move query point)
SDF Query
Query point
SDF value
Status
|distance|
query inside (SDF < 0)
query outside (SDF > 0)
on surface (SDF ≈ 0)
nearest-surface line

3. Measuring Distance Between Two Shapes

Training any 3D generative or reconstruction model requires a loss that measures how different two point sets are. Three metrics dominate: Chamfer (differentiable average), Hausdorff (worst-case coverage), and Earth Mover's Distance (optimal transport). Each captures a fundamentally different geometric property.

3.1 Chamfer Distance

$$\text{CD}(P,Q)=\frac{1}{|P|}\sum_{p\in P}\min_{q\in Q}\|p-q\|^2+\frac{1}{|Q|}\sum_{q\in Q}\min_{p\in P}\|p-q\|^2$$

Every point in $P$ finds its nearest neighbor in $Q$, and vice versa. The Chamfer Distance averages all these squared distances. Differentiable almost everywhere — the standard training loss. Key flaw: multiple points in $P$ can match the same point in $Q$ (many-to-one), so local clumping is rewarded.

3.2 Hausdorff Distance

$$d_H(P,Q)=\max\!\left(\max_{p\in P}\min_{q\in Q}\|p-q\|,\;\max_{q\in Q}\min_{p\in P}\|q-p\|\right)$$

The worst-case nearest-neighbor distance. One outlier dominates — 99% perfect coverage does not help. The scenarios below show exactly when this matters: try the "One outlier" preset to see Chamfer stay flat while Hausdorff jumps dramatically.

3.3 Earth Mover's Distance (EMD)

$$\text{EMD}(P,Q)=\min_{\phi:P\xrightarrow{\sim}Q}\frac{1}{|P|}\sum_{i=1}^{|P|}\|p_i-\phi(p_i)\|$$

EMD enforces a bijection $\phi$ — each point in $P$ matches to a unique point in $Q$ (one-to-one). This is the minimum-cost way to "transport" mass from $P$ to $Q$, solved exactly with the Hungarian algorithm. Requires $|P|=|Q|$. EMD $\geq$ Chamfer always, because Chamfer's many-to-one matching is a relaxation of the bijection constraint.

Chamfer vs EMD — the key difference

In Chamfer mode, many blue P-points may all draw their line to the same orange Q-point (the shared nearest neighbor). In EMD mode, each Q-point is matched to exactly one P-point. The optimal bijection "spreads" assignments more evenly — watch how the lines change when you switch modes.

Chamfer Distance
CD
Hausdorff
EMD
|P| = |Q|

Drag any point. Switch metrics or try a preset scenario to see how each metric responds differently.

Chamfer — training losses

Differentiable, KD-tree efficient, robust to small outliers. Standard in PointNet, FoldingNet, occupancy networks. Many-to-one bias can cause "collapsed" reconstructions.

Hausdorff — worst-case coverage

One large gap dominates. Critical for safety (autonomous driving, medical). Never used as a training loss — non-differentiable and noise-sensitive.

EMD — balanced transport

Bijection prevents collapse; measures the true cost of rearranging $P$ into $Q$. More meaningful for generative models. $O(n^3)$ Hungarian algorithm — expensive for large sets.

EMD ≥ Chamfer always

Chamfer relaxes the bijection. If nearest-neighbor matching is already bijective, $\text{EMD}=\text{CD}$. Otherwise the forced bijection in EMD forces some assignments to worse matches.

4. Explicit vs. Implicit Representations

All 3D representations fall into one of two philosophies. Explicit: directly list the surface elements. Implicit: define a scalar field whose zero level set is the surface. The 3D side-by-side below renders the same torus shape both ways in real time.

Explicit

Mesh, point cloud, NURBS — every element directly describes part of the surface. Hand to GPU rasterizer, done.

Implicit

SDF, occupancy network, NeRF — the surface is $\{x:f(x)=0\}$. Must ray-march or run Marching Cubes to get geometry. But topology changes, interior queries, and blending are free.

Ray Marching — How Implicit Surfaces Are Rendered

Since an implicit surface has no triangles to rasterize, the GPU must search for where a ray intersects the zero level set. Sphere tracing is the canonical algorithm: at each position along the ray, evaluate the SDF. Because the SDF tells you the exact distance to the nearest surface point, you can safely advance the ray by exactly that amount without overshooting — each step takes the largest possible safe jump.

$$x_{t+1}=x_t+f(x_t)\cdot\hat{d},\qquad\hat{d}=\text{ray direction}$$ $$\text{Stop when }|f(x_t)|<\varepsilon\text{ (hit) or }t>t_{\max}\text{ (miss)}$$

The key insight: when you are far from the surface, $f$ is large and you take big steps. Near the surface, $f\to 0$ and you take tiny, precise steps. This converges much faster than any fixed-step ray marching. Click anywhere on the scene below to fire a ray and watch sphere tracing step by step.

Sphere Tracing — 2D Step-by-step
Sphere Tracing
Steps taken
SDF at pos
Statusclick to start
sphere of radius = SDF
step forward by SDF
hit (SDF < ε)
miss (t > t_max)

Click anywhere to aim a ray from the eye (white dot, left). Each yellow circle has radius = SDF at that point — the ray can safely step by that amount. Background heatmap: blue = inside, red = outside.

The right 3D panel below uses exactly this algorithm in a GLSL fragment shader — every pixel fires a ray and sphere-traces the torus SDF. The torus is never stored as triangles; it is defined by a single mathematical function evaluated per pixel.

Explicit vs Implicit — 3D (drag each panel to orbit)
Explicit — Triangle Mesh
Implicit — Ray-marched SDF
ExplicitImplicit
Surface accessDirectly listedExtract (Marching Cubes) or ray-march
Topology changeVery hardNatural — level set evolves
ResolutionFixed at creationQuery at any resolution
GPU renderingRasterizationRay marching / vol. rendering
Inside/outsideExpensive (ray casting)Free — just check sign of $f$

Conclusion

We have traveled from the simplest possible 3D representation — an unordered set of points — to live ray-marched implicit surfaces. The connecting thread:

  1. Point clouds are what sensors give you. Everything else adds structure: connectivity (mesh), spatial index (KD-tree, voxel), or compactness (tri-plane).
  2. Normals bridge geometry and shading. Face normals (one per triangle, flat look) vs. vertex normals (averaged at shared vertices, smooth look) are the same geometry rendered differently. The normal map encoding $(\mathbf{n}+1)/2\to\text{RGB}$ is universal in graphics and neural rendering.
  3. Chamfer, Hausdorff, and EMD are three lenses on shape similarity. Chamfer sees average coverage (use for training). Hausdorff sees worst-case gaps (use for evaluation). EMD enforces a fair bijection (use for generative model quality).
  4. Sphere tracing is what makes implicit surfaces renderable: step by the SDF value, converging faster far from the surface and slowing near it. The right panel above renders an entire torus this way — no triangles, pure math.
  5. Explicit vs. implicit is not binary. 3DGS is explicitly positioned but implicitly optimized. NeRF is implicitly defined but explicitly ray-marched.
Further Reading

DeepSDF (Park et al., 2019) · EG3D Tri-plane (Chan et al., 2022) · NeRF (Mildenhall et al., 2020) · 3D Gaussian Splatting (Kerbl et al., 2023) · PointNet (Qi et al., 2017)

All visualizations built with Three.js (3D) and Canvas 2D. Math rendered with KaTeX. Shapes generated procedurally. Written with the help of Claude.