published 2023, updated 2025

1. Introduction

Massive amounts of effort and money have gone into scaling (e.g. compute, data, parameters), and some architectural and efficiency tweaks (e.g., Mixture-of-Experts, attention variants, better optimizers, better parallelism). All of which are valuable. But they also raise an uncomfortable question:

If our deep neural networks are already “universal approximators,” Why does it still feel like we’re brute-forcing representation—sometimes learning good performance without learning good structure? why do LLMs need all the web text to learn how to write basic sentences? Why it takes millions (if not billions) of images to generate a five-fingered hand image?

This essay argues that the bottleneck is representational efficiency, more than anything else, more than raw expressivity. The headline “universal approximation” is correct but incomplete: it does not tell you how many parameters, how many regions, how much depth, or how much data is needed to represent the structure of interest in a way that training actually finds.

2. Universality is the wrong headline

A model class being a universal approximator means: for any (reasonable) target function $f^$ and error tolerance $\epsilon$, there exists parameters $\theta$ such that $f_\theta$ approximates $f^$ within $\epsilon$.

But universality says nothing about:

How many parameters you need to reach $\epsilon$.
How stable that approximation is off-distribution.
Whether the representation is modular / reusable vs “spaghetti.”
Whether SGD can find the efficient representation.

So the right question is not “can it represent it?”, but:

What is the approximation rate and the learnability of that representation under realistic training?

3. Why most deep nets naturally become “partition + affine templates”

Consider a standard MLP:

$h^{(0)} = x,\quad h^{(\ell+1)} = \sigma\!\left(W^{(\ell)} h^{(\ell)} + b^{(\ell)}\right), \quad f(x) = W^{(L)}h^{(L)}+b^{(L)}$.

With ReLu, $\sigma(z)=\max(z,0)$, define a gate vector:

$g^{(\ell)}(x) = \mathbf{1}\{W^{(\ell)}h^{(\ell)}(x)+b^{(\ell)} > 0\}$

Conditioned on a fixed gating pattern across layers, each ReLu is either identity (“on”) or zero (“off”). The entire network collapses to an affine function:

$f(x) = A_{g}x + c_{g}$.

So the input space is partitioned into regions (polytopes) where the gating pattern is constant, and on each region the network is affine. This is the intuitive core formalized in A Spline Theory of Deep Networks (Balestriero & Baraniuk, 2018), which expresses such networks using max-affine spline operators (MASOs) and shows how these models implement a “template matching” view: choose an affine template based on region membership.