<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>Elements of a Vector Space - Deep learning theory</title><link href="https://elonlit.com/" rel="alternate"></link><link href="https://elonlit.com/feeds/deep-learning-theory.atom.xml" rel="self"></link><id>https://elonlit.com/</id><updated>2026-05-04T17:17:00-07:00</updated><entry><title>A Theory of Deep Learning</title><link href="https://elonlit.com/scrivings/a-theory-of-deep-learning/" rel="alternate"></link><published>2026-05-04T17:17:00-07:00</published><updated>2026-05-04T17:17:00-07:00</updated><author><name>Elon Litman</name></author><id>tag:elonlit.com,2026-05-04:/scrivings/a-theory-of-deep-learning/</id><summary type="html">&lt;p&gt;&lt;img alt="Flowers" class="invert-dark" src="/images/flowers.png" width="512px"&gt;&lt;/p&gt;
&lt;section&gt;
&lt;p&gt;Borges wrote a story about a man named Funes who, after a horseback accident, acquires the ability to perceive and remember everything. Every leaf on every tree. Every ripple on every stream at every moment. He is the perfect empiricist. Infinite data, infinite recall, infinite resolution. And he cannot think …&lt;/p&gt;&lt;/section&gt;</summary><content type="html">&lt;p&gt;&lt;img alt="Flowers" class="invert-dark" src="/images/flowers.png" width="512px"&gt;&lt;/p&gt;
&lt;section&gt;
&lt;p&gt;Borges wrote a story about a man named Funes who, after a horseback accident, acquires the ability to perceive and remember everything. Every leaf on every tree. Every ripple on every stream at every moment. He is the perfect empiricist. Infinite data, infinite recall, infinite resolution. And he cannot think. Because thinking, as Borges understood, requires forgetting. Funes could reconstruct entire days from memory but could not understand why the dog at 3:14, seen from the side, should be called the same thing as the dog at 3:15, seen from the front.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;I suspect [that Funes] was not very good at thinking. To think is to ignore (or forget) differences, to generalize, to abstract. In the teeming world of Ireneo Funes there was nothing but particulars.&lt;/em&gt;&lt;label for="sn-borges" class="margin-toggle sidenote-number"&gt;&lt;/label&gt;&lt;input type="checkbox" id="sn-borges" class="margin-toggle"&gt;&lt;span class="sidenote"&gt;Jorge Luis Borges, "Funes the Memorious," in &lt;em&gt;Ficciones&lt;/em&gt; (1944).&lt;/span&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Later in the story, Borges conjures Locke, who in the seventeenth century postulated an impossible language in which each individual thing, each stone, each bird and each branch, would have its own name. Funes projected an analogous language but discarded it because it seemed too general to him, too ambiguous. Deep learning theory has built Locke's language and is well on its way to Funes'. More parameters. More data. Deeper networks. More compute. Uniform convergence people, optimization people, NTK people, PAC-Bayes people, stability people, mean-field people, all working on the same problem, none of them speaking the same language, each proving bounds under assumptions that are vacuous under each other's assumptions.&lt;/p&gt;
&lt;p&gt;Deep learning alchemy today is where chemistry was before &lt;em&gt;Lavoisier&lt;/em&gt;: a practice that works, built on a theory that doesn't. Everyone agrees this is a problem. Few believe it is a solvable one. At the Diffusion Group at Stanford, we have been trying for some time to answer this question, which most of our colleagues consider premature and quixotic: &lt;em&gt;why does deep learning work?&lt;/em&gt; We think we have an answer.&lt;/p&gt;
&lt;p&gt;But first, to see why the question is hard, start with what classical theory predicts. Classical statistical learning theory posits the bias-variance tradeoff: too simple and you underfit the data, too expressive and you overfit. Deep neural networks are highly expressive and overparameterized&amp;mdash;they have far more parameters than data points; they can shatter any possible labeling of the data. During training, the network interpolates the training data perfectly, including all noise, achieving zero error. Surely, the test error should be catastrophic.&lt;label for="sn-zhang" class="margin-toggle sidenote-number"&gt;&lt;/label&gt;&lt;input type="checkbox" id="sn-zhang" class="margin-toggle"&gt;&lt;span class="sidenote"&gt;Zhang &lt;em&gt;et al.&lt;/em&gt;, "Understanding Deep Learning (Still) Requires Rethinking Generalization," &lt;em&gt;Communications of the ACM&lt;/em&gt; 64, no. 3 (2021). The original 2017 version demonstrated that standard architectures can memorize random labels, establishing that classical capacity-based explanations of generalization are insufficient.&lt;/span&gt; But then, the test error&amp;hellip;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;is also very low.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;This is called &lt;em&gt;benign overfitting.&lt;/em&gt; It violates the most basic intuition in statistical learning theory.&lt;label for="sn-benign" class="margin-toggle sidenote-number"&gt;&lt;/label&gt;&lt;input type="checkbox" id="sn-benign" class="margin-toggle"&gt;&lt;span class="sidenote"&gt;Bartlett &lt;em&gt;et al.&lt;/em&gt;, "Benign Overfitting in Linear Regression," &lt;em&gt;PNAS&lt;/em&gt; 117, no. 48 (2020).&lt;/span&gt; You fit the training data exactly, so presumably the noise must have been destroyed, or rendered harmless in some form.&lt;/p&gt;
&lt;p&gt;Trying to visualize the bias-variance tradeoff with neural networks doesn't yield the expected U-shaped curve, but instead shows &lt;em&gt;double descent.&lt;/em&gt; Test error goes up as model complexity increases, then comes back &lt;em&gt;down&lt;/em&gt; past the interpolation threshold.&lt;label for="sn-dd" class="margin-toggle sidenote-number"&gt;&lt;/label&gt;&lt;input type="checkbox" id="sn-dd" class="margin-toggle"&gt;&lt;span class="sidenote"&gt;Belkin &lt;em&gt;et al.&lt;/em&gt;, "Reconciling Modern Machine Learning Practice and the Bias-Variance Trade-off," &lt;em&gt;PNAS&lt;/em&gt; 116, no. 32 (2019).&lt;/span&gt; At the exact moment the network gains the capacity to memorize everything, it begins to generalize.&lt;/p&gt;
&lt;p&gt;&lt;img alt="DD" class="invert-dark" src="/images/double_descent_concept.png" style="display: block; margin: auto;" width="512px"&gt;&lt;/p&gt;
&lt;p&gt;Gradient descent, given infinitely many solutions that interpolate the data, picks ones that generalize (usually low &lt;span class="math"&gt;\(\ell_2\)&lt;/span&gt;-norm, low nuclear norm, approximately low-rank). This is called &lt;em&gt;implicit bias&lt;/em&gt;.&lt;label for="sn-implicit" class="margin-toggle sidenote-number"&gt;&lt;/label&gt;&lt;input type="checkbox" id="sn-implicit" class="margin-toggle"&gt;&lt;span class="sidenote"&gt;Gunasekar &lt;em&gt;et al.&lt;/em&gt;, "Implicit Regularization in Matrix Factorization," &lt;em&gt;NeurIPS&lt;/em&gt; (2017), and Soudry &lt;em&gt;et al.&lt;/em&gt;, "The Implicit Bias of Gradient Descent on Separable Data," &lt;em&gt;JMLR&lt;/em&gt; 19 (2018).&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Lastly, in cases where the data-generating distribution is highly structured and the network doesn't possess the right inductive bias, the network memorizes the training set, then much later, hundreds of thousands of steps later, suddenly generalizes. This is &lt;em&gt;grokking.&lt;/em&gt;&lt;label for="sn-grokking" class="margin-toggle sidenote-number"&gt;&lt;/label&gt;&lt;input type="checkbox" id="sn-grokking" class="margin-toggle"&gt;&lt;span class="sidenote"&gt;Power &lt;em&gt;et al.&lt;/em&gt;, "Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets," arXiv:2201.02177 (2022).&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Our explanation is available via preprint &lt;a href="https://arxiv.org/abs/2605.01172"&gt;here&lt;/a&gt;.&lt;label for="sn-litman" class="margin-toggle sidenote-number"&gt;&lt;/label&gt;&lt;input type="checkbox" id="sn-litman" class="margin-toggle"&gt;&lt;span class="sidenote"&gt;Litman &amp;amp; Guo, "A Theory of Generalization in Deep Learning," arXiv:2605.01172.&lt;/span&gt; It comes with proofs, experiments, and an &lt;a href="https://github.com/elonlit/PopRiskMinimization"&gt;algorithm&lt;/a&gt; that allows you to train on the population risk of any model, loss function, and dataset.&lt;/p&gt;
&lt;/section&gt;
&lt;section&gt;
&lt;h2&gt;The Theory&lt;/h2&gt;
&lt;p&gt;The standard approach treats a neural network as a point in a hypothesis class, attempting to bound its complexity across billions of parameters. We propose a radical &lt;em&gt;Vereinfachung&lt;/em&gt;: abandoning the parameter space entirely. Instead, we analyze the network as a dynamical system strictly in &lt;em&gt;output space&lt;/em&gt;, focusing on how predictions evolve and where &lt;em&gt;error flows&lt;/em&gt;. Stack all training outputs into a vector &lt;span class="math"&gt;\(U_S \in \mathbb{R}^{np}\)&lt;/span&gt;. Form the Jacobian &lt;span class="math"&gt;\(J_S = D_w U_S\)&lt;/span&gt;, the matrix of partial derivatives of every output with respect to every parameter. The object that governs everything is the empirical Neural Tangent Kernel (eNTK):&lt;label for="sn-ntk" class="margin-toggle sidenote-number"&gt;&lt;/label&gt;&lt;input type="checkbox" id="sn-ntk" class="margin-toggle"&gt;&lt;span class="sidenote"&gt;Jacot &lt;em&gt;et al.&lt;/em&gt;, "Neural Tangent Kernel: Convergence and Generalization in Neural Networks," &lt;em&gt;NeurIPS&lt;/em&gt; (2018).&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;
&lt;div class="math"&gt;$$K_{SS}(w) = J_S(w) J_S(w)^\top$$&lt;/div&gt;
&lt;/p&gt;
&lt;p&gt;A matrix that tells you, for every pair of training points, how much a gradient step on one affects the prediction on the other. Under gradient flow, the training outputs and their gradient evolve as&lt;/p&gt;
&lt;p&gt;
&lt;div class="math"&gt;$$\partial_t u = -K_{SS} g$$&lt;/div&gt;
&lt;div class="math"&gt;$$\partial_t g = -B K_{SS} g$$&lt;/div&gt;
&lt;/p&gt;
&lt;p&gt;Where &lt;span class="math"&gt;\(g = \nabla \Phi_S(u)\)&lt;/span&gt; is the output gradient and &lt;span class="math"&gt;\(B = \nabla^2 \Phi_S(u)\)&lt;/span&gt; is the loss Hessian. The test outputs evolve in parallel through the cross-kernel &lt;span class="math"&gt;\(K_{QS} = J_Q J_S^\top\)&lt;/span&gt;:&lt;/p&gt;
&lt;p&gt;
&lt;div class="math"&gt;$$\partial_t U_Q = -K_{QS} g$$&lt;/div&gt;
&lt;/p&gt;
&lt;p&gt;This holds for any differentiable architecture and any convex loss, without any infinite-width or depth limit. The loss itself dissipates as&lt;/p&gt;
&lt;p&gt;
&lt;div class="math"&gt;$$\frac{d}{dt}\Phi_S(u(t)) = -g(t)^\top K_{SS}(t) \, g(t) = -\|J_S^\top g\|_2^2$$&lt;/div&gt;
&lt;/p&gt;
&lt;p&gt;Loss decreases at a rate set by the kernel. Decompose &lt;span class="math"&gt;\(g\)&lt;/span&gt; along eigenvectors &lt;span class="math"&gt;\(v_i\)&lt;/span&gt; of &lt;span class="math"&gt;\(K_{SS}\)&lt;/span&gt; with eigenvalues &lt;span class="math"&gt;\(\lambda_i\)&lt;/span&gt;. For squared loss the residual &lt;span class="math"&gt;\(r = u - y\)&lt;/span&gt; obeys &lt;span class="math"&gt;\(\partial_t r = -M(t)r\)&lt;/span&gt; where &lt;span class="math"&gt;\(M = K_{SS}/n\)&lt;/span&gt;, so the component along &lt;span class="math"&gt;\(v_i\)&lt;/span&gt; decays as &lt;span class="math"&gt;\(e^{-\lambda_i t / n}\)&lt;/span&gt;. A mode with eigenvalue &lt;span class="math"&gt;\(10\lambda\)&lt;/span&gt; is learned ten times faster. On any finite training horizon, modes below some eigenvalue threshold have barely moved. Given infinite time, all modes are interpolated, noise included.&lt;/p&gt;
&lt;p&gt;In the feature learning regime, the kernel is not fixed. As the parameters move, the eigenvectors rotate and the eigenvalues shift, so signal and noise get rearranged. Here is the kernel rotating (plotted by centering and normalizing its Gram matrix, extracting eigenstructure changes relative to initialization, mapping those changes into a shaded deformed surface):&lt;/p&gt;
&lt;p&gt;&lt;img alt="Kernel" class="invert-dark" src="/images/kernel_evolution.png" style="display: block; margin: auto;" width="512px"&gt;&lt;/p&gt;
&lt;p&gt;To capture the cumulative effect of the entire training trajectory, we take the time integral of the eNTK:&lt;/p&gt;
&lt;p&gt;
&lt;div class="math"&gt;$$\mathcal{W}_S(s,T) = \int_s^T P_g(\tau,s)^\top K_{SS}(\tau) P_g(\tau,s) \, d\tau$$&lt;/div&gt;
&lt;/p&gt;
&lt;p&gt;where &lt;span class="math"&gt;\(P_g\)&lt;/span&gt; is the propagator of the gradient ODE. The eigenvalue of &lt;span class="math"&gt;\(\mathcal{W}_S\)&lt;/span&gt; along direction &lt;span class="math"&gt;\(\psi_j\)&lt;/span&gt; is the total integrated squared reachability of that direction over the entire training window:&lt;/p&gt;
&lt;p&gt;
&lt;div class="math"&gt;$$\lambda_j = \int_s^T \|J_S(\tau)^\top P_g(\tau,s) \psi_j\|_2^2 \, d\tau$$&lt;/div&gt;
&lt;/p&gt;
&lt;p&gt;Directions with large &lt;span class="math"&gt;\(\lambda_j\)&lt;/span&gt; are where training dissipated loss. This is the &lt;em&gt;signal channel&lt;/em&gt;, &lt;span class="math"&gt;\(\text{range}(\mathcal{W}_S)\)&lt;/span&gt;. Directions with &lt;span class="math"&gt;\(\lambda_j = 0\)&lt;/span&gt; are where training dissipated nothing. This is the &lt;em&gt;reservoir&lt;/em&gt;, &lt;span class="math"&gt;\(\ker(\mathcal{W}_S)\)&lt;/span&gt;.&lt;/p&gt;
&lt;p&gt;Now define the test transfer operator&lt;/p&gt;
&lt;p&gt;
&lt;div class="math"&gt;$$G_Q(T,s) = \int_s^T K_{QS}(\tau) P_g(\tau,s) \, d\tau$$&lt;/div&gt;
&lt;/p&gt;
&lt;p&gt;which propagates the initial gradient to test displacement: &lt;span class="math"&gt;\(U_Q(T) - U_Q(s) = -G\,g(s)\)&lt;/span&gt;. We show that &lt;em&gt;&lt;span class="math"&gt;\(G\)&lt;/span&gt; vanishes on the reservoir.&lt;/em&gt; &lt;span class="math"&gt;\(\ker \mathcal{W} \subseteq \ker G\)&lt;/span&gt;. Thus, whatever the network memorized in the reservoir is invisible at test time. The point of overparameterization, of depth, of inductive bias, is to give the kernel a spectrum that puts signal in the channel and noise in the reservoir.&lt;/p&gt;
&lt;/section&gt;
&lt;section&gt;
&lt;h2&gt;The Field, Reinterpreted&lt;/h2&gt;
&lt;p&gt;This theory unifies the major puzzles of deep learning theory under one mechanism.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Benign overfitting&lt;/em&gt; is noise sitting in the reservoir at interpolation. The network memorized the noise in the train set, but the noise is in the reservoir &lt;span class="math"&gt;\(\ker \mathcal{W}_S\)&lt;/span&gt;, which is test-invisible. It doesn't matter.&lt;label for="sn-pedagogy" class="margin-toggle sidenote-number"&gt;&lt;/label&gt;&lt;input type="checkbox" id="sn-pedagogy" class="margin-toggle"&gt;&lt;span class="sidenote"&gt;
As a pedagogical sidenote: yes, I know that in highly overparameterized networks this is technically a soft reservoir of near-zero eigenvalues rather than strictly a mathematical null space, but treating it as a hard boundary is the best way to build intuition for why that trapped noise disappears at test time.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Double descent&lt;/em&gt; is noise moving between the signal channel and the reservoir as model capacity sweeps across interpolation. At the interpolation threshold, noise briefly enters the signal channel and test error spikes. Past it, the noise gets absorbed back into the reservoir.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Implicit bias&lt;/em&gt; is the spectral schedule of &lt;span class="math"&gt;\(\mathcal{W}_S(t)\)&lt;/span&gt; filling the signal channel from the largest kernel eigenvalue down. Gradient flow learns parsimonious, high-mobility modes first and low-mobility modes last. By strictly confining its test predictions to this accumulated signal channel, the network acts as a Moore-Penrose pseudo-inverse over the realized path, effectively finding the minimum-norm solution in the dynamic feature space rather than the static parameter space.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Grokking&lt;/em&gt; is signal migrating from the reservoir into the signal channel as the kernel evolves over training. The network memorizes first (fast noise-fitting modes saturate early), then generalizes later (slow signal modes finally enter the signal channel).&lt;/p&gt;
&lt;p&gt;By the way: the same operators that explain generalization also give you a way to train directly on population risk. Treating each training point in a minibatch as a one-point held-out test set against the rest and localizing to a single optimizer step collapses the operator expression to a per-parameter rule: update parameter &lt;span class="math"&gt;\(k\)&lt;/span&gt; if and only if &lt;/p&gt;
&lt;p&gt;
&lt;div class="math"&gt;$$\mu_k^2 &amp;gt; \frac{\sigma_k^2}{b-1}$$&lt;/div&gt;
&lt;/p&gt;
&lt;p&gt;That is, if the batch signal on a parameter exceeds its leave-one-out noise, update it; if not, skip it. This is a one-line change to Adam that accelerates grokking by &lt;span class="math"&gt;\(5 \times\)&lt;/span&gt;, suppresses memorization in PINNs, and improves DPO fine-tuning, eliminating the need for validation sets entirely.&lt;/p&gt;
&lt;/section&gt;
&lt;section&gt;
&lt;h2&gt;What the Future Holds&lt;/h2&gt;
&lt;p&gt;The math indicates several exciting areas of research on the horizon. The first implication is that we have been training neural networks with a tragic amount of waste. Gradient descent currently functions as a pointwise simulation of a dynamical system whose asymptotic behavior we can characterize in closed form. This exact characterization is possible because in output space, training dynamics can be understood through a locally linear differential equation along the realized path, where dominant eigenmodes of the evolving kernel equilibrate exponentially fast. Forcing an optimizer to slowly step through these solved directions is highly inefficient and suggests a path to analytically jump to the final network state.&lt;/p&gt;
&lt;p&gt;Our theory also provides the foundation necessary to train neural networks directly on the population risk, completely bypassing the fundamental compromise of machine learning. Moving away from pure empirical risk minimization allows networks to target true generalization natively during the training process, eliminating overfitting as we understand it.&lt;/p&gt;
&lt;p&gt;Finally, understanding that overparameterization primarily serves to create a larger test-invisible reservoir invites a fundamental rethinking of model architecture. We can now explore whether it is possible to achieve the generalization benefits of infinite scale by designing smaller, highly efficient models that optimally sequester label noise. &lt;span class="math"&gt;\(\blacksquare\)&lt;/span&gt;&lt;/p&gt;
&lt;/section&gt;</content><category term="Deep learning theory"></category><category term="Deep learning theory"></category></entry></feed>