Your Transformer is Secretly an EOT Solver
Elon Litman
October 21, 2025
Filed under “Deep learning”
Nine months ago, or was it ten? I've learned not to trust my sense of time when deep in a problem. Anyone faced with a major deadline or enveloped in the mire of one particular question for an extended period intuitively understands that Time becomes something else then. Not the measured progression of hours, but rather a kind of thickness, a medium through which one moves with increasing difficulty, like a swimmer who has ventured too far from shore and finds the water has become viscous, resistant. I was doing what I had seen others do, what perhaps I had been doing all along without quite admitting it to myself: trying to invent yet another subquadratic attention mechanism. Yes, another one. As if the world needed more of these thingsThis is perhaps too cynical. If you are the future inventor of FlashAttention 4, thank you. The world does need efficient implementations, and even asymptotic improvements in memory complexity, but this is best measured in reducing constant factors rather than shaving $\mathcal{O}(\log n + \log\log n)$ time to $\mathcal{O}(\log n + \frac{\log\log n}{\log\log\log n})$.. There are already so many. I knew this. You know this. Anyone who works in this field knows this. And yet we persist, as men have always persisted in rebuilding Babel, each generation convinced that this time, with these new tricks, with some recherché materials, with this understanding our predecessors lacked, the tower will stand.
But mine, of course, was going to be different. Mine was going to be new and quirky. Mine was going to reflect practical considerations; real constants and implementation details rather than mere Big-O algorithmic obscurantism.
The idea, such as it was, involved an approximation. But not just any approximation. This was to be an approximation that became asymptotically accurate as \(n\) (sequence length) grew. Which is to say: the kind of thing that's wrong in all the ways that don't matter and right in the one way that does. There was, naturally, some unanticipated problem. There is always some unanticipated problem. It's practically a theorem in this field.
Why Asymptotically Accurate Subquadratic Attention is Impossible
Let me explain the nature of that unanticipated problem I mentioned. It is mathematically impossible for any subquadratic attention mechanism to be asymptotically accurate on all inputs.
The intuition is a \(\textit{needle in a haystack}\) construction. The softmax function requires knowing all query-key similarities in each row to compute the normalization constant correctly. If an algorithm inspects only \(o(n^2)\) query-key pairs, then by the pigeonhole principle, there must exist at least one query row \(i^*\) for which it examines only \(o(n)\) of the \(n\) possible keys. Place an arbitrarily large similarity score in one of the uninspected positions, the "needle," and the algorithm will fail catastrophically because it lacks information essential to computing the attention weights.
More formally, define a subquadratic algorithm as one computing \(o(n^2)\) query-key dot products, and asymptotically accurate as producing output that converges to the exact attention matrix in the limit as \(n \to \infty\). Assume, for contradiction, that such an algorithm \(\mathcal{A}\) exists.
By subquadraticity, the average number of keys inspected per query is \(o(n)\). Therefore, there exists a query index \(i^*\) and a corresponding uninspected key index \(j^*\). Now construct an adversarial input: let \(q_{i^*} = e_1\), let \(k_{j^*} = M_n e_1\) where \(M_n = \ln(n^2)\), and set all other queries, keys, and values to zero vectors. The exact attention output for row \(i^*\) concentrates almost entirely on position \(j^*\), converging to \(e_1\) for \(n \to \infty\). But since \(\mathcal{A}\) never inspects \((i^*, j^*)\), it has no information about this dominant contribution and must output approximately zero, yielding a non-vanishing error lower bound of \(\|e_1\|_2 = 1\) in the limit.
The contradiction arises from the global normalization inherent in softmax: computing accurate attention weights requires complete information about all query-key similarities in each row. A subquadratic algorithm cannot provide this complete information for all rows simultaneously, making asymptotic accuracy impossible.
So I walked away. I came back. I walked away again. And then, I had a thought. The thought was this: perhaps there is no way forward through brute force. Perhaps the only way through is to find some hidden mathematical structure, some symmetry or equivalence that's been sitting there all along, waiting for someone to notice it.
That thought, as it happens, was correct. Which is how I ended up writing this paperThis blog post is a companion to my arXiv paper "Scaled-Dot-Product Attention as One-Sided Entropic Optimal Transport" (arXiv:2508.08369). That paper extends this analysis to show that the backward pass gradient is mathematically identical to an advantage-based policy gradient, and demonstrates how the EOT formulation induces a specific information geometry characterized by the Fisher Information Matrix. The unified view reveals attention as a mechanism where the forward pass performs optimal inference and the backward pass implements a manifold-aware learning update.. The paper, to be clear, has nothing to do with subquadratic attention. So I'm telling the story here: how trying to build something uninteresting led, through impossibility, to discovering something genuinely interesting: an exact equivalence between scaled dot-product attention and entropic optimal transport.
Attention as Entropic Optimal Transport
We will demonstrate the capstone result of the paper: that the scaled dot-product attention weights for a collection of queries are equivalent to the unique solution of a particular one-sided entropic optimal transport problem. Specifically, one where the costs are given by the negative key-query dot products, the source space is equipped with the counting measure, and the target space is left unconstrained.
We begin by constructing this optimal transport problem. Consider the standard setup for computing attention weights for a set of \(n\) query vectors, collected rowwise into a real matrix of size \(n \times d_k\).
These attend over a collection of \(m\) key vectors collected rowwise into a real matrix of size \(m \times d_k\).
Our objective is to formulate an optimal transport problem whose unique solution is exactly the attention weight matrix \(A \in \mathbb{R}_{+}^{n \times m}\), where each component \(A_{ij}\) is defined by scaled dot-product attention with temperature \(\tau = \sqrt{d_k}\):
Each row of \(A\) sums to one, so \(A\) is row-stochastic.
The formulation proceeds by specifying the source and target index sets, the measures on them, the feasible couplings, the cost, and the entropic regularizationThe entropic regularization of optimal transport, which makes the problem computationally tractable, was introduced by Marco Cuturi in his seminal 2013 paper "Sinkhorn Distances: Lightspeed Computation of Optimal Transport." This regularization trades exact optimality for numerical stability and scalability..
Query source space and measure
The source index set is \({1,2,\dots,n}\), one index per query vector. The source measure \(a\) assigns unit mass to every source index. In other words, \(a_i = 1\) for all \(i\), so the total source mass is \(n\). You can think of each query \(q_i\) as emitting one unit of attention mass to be distributed across the target indices.
Key target space
The target index set is \({1,2,\dots,m}\), one index per key vector. Each \(j\) is a potential destination for mass originating at any source \(i\).
A critical choice is how to treat the target marginal \(b\) on the target set. Unlike classical optimal transport that fixes both the source and target marginals, we leave the target marginal free. The distribution of mass arriving at the targets is not constrained ahead of time and will be determined by the optimization. This is a one-sided optimal transport problem.
Transport plans and feasibility
A transport plan is a nonnegative matrix \(P \in \mathbb{R}_{+}^{n \times m}\), where \(P_{ij}\) is the amount of mass sent from source \(i\) to target \(j\). Feasibility is enforced only through the source constraints
for all \(i \in \{1,\ldots,n\}\). There are no column constraints because the target marginal is free. The feasible set is therefore
This row-sum constraint applies for all \(i\). Each row of any feasible \(P\) lies in the standard probability simplex of dimension \(m-1\).
Cost and entropy
We need a transport cost \(C_{ij}\) for moving mass from \(i\) to \(j\). Attention prefers high similarity \(q_i^\top k_j\), while optimal transport minimizes cost. We therefore define cost as negative similarity:
The total cost of a plan \(P\) is
so minimizing cost is the same as maximizing the similarity-weighted sum.
To entropically regularize the problem, we use the matrix Shannon entropy
with the convention \(0 \log 0 = 0\). We scale the negative entropy by a positive parameter \(\varepsilon\) and choose \(\varepsilon\) to match the attention temperature, \(\varepsilon = \tau = \sqrt{d_k}\).
Objective and constrained problem
Combine cost and negative entropy to form
Substituting the expressions for cost and entropy gives
The generalized, one-sided entropic optimal transport problem that matches full scaled dot-product attention is
Existence and Uniqueness (theorem only)
Let the source be \({1,\dots,n}\) and the target be \({1,\dots,m}\). Let \(q_i\) and \(k_j\) be finite vectors in the common key dimension. Let the feasible set be
Here too, the row-sum constraint holds for all \(i\).
For any \(\varepsilon > 0\), the functional
attains a unique minimum \(P^\star\) on \(\mathcal{U}\).
EOT solution equals the attention matrix
Having the existence and uniqueness of the minimizer for the generalized entropic optimal transport problem over all queries and keys, we now derive its analytical form and show it equals the scaled dot-product attention matrix.
Let \(A = (A_{ij})\) be the attention matrix with temperature \(\tau = \sqrt{d_k}\):
Consider the one-sided entropic OT problem
with \(\varepsilon = \tau\). The unique minimizer \(P^\star\) equals \(A\).
Proof
We minimize \(J(P)\) subject to the \(n\) equality constraints \(\sum_{j=1}^m P_{ij} = 1\) for each \(i\), with the nonnegativity constraints \(P_{ij} \ge 0\). Introduce Lagrange multipliers \(\lambda_1,\dots,\lambda_n\) for the row-sum constraints and write the Lagrangian
Take the partial derivative with respect to an arbitrary \(P_{ij}\) and set it to zero at the optimum. For any optimal solution with strictly positive entries (which will hold here), the stationarity condition is
Solve for \(P_{ij}\):
Let \(Z_i = \exp\left(\frac{\lambda_i}{\varepsilon} - 1\right)\), which depends only on the row index \(i\). Then
Enforce the row-sum constraint to determine \(Z_i\):
Substitute back:
Set \(\varepsilon = \tau\) and obtain
So the unique minimizer \(P^\star\) equals the scaled dot-product attention matrix \(A\). Positivity holds because exponentials are positive and the denominator is positive as long as dot products are finite. The solution lies in the relative interior and satisfies the KKT conditions. Uniqueness follows from strict convexity of \(J(P)\) on the convex feasible set.
The Discovery
I should also mention that this is not, in the strictest sense, a full transport problem. In classical optimal transport, you fix both where the mass starts and where it ends up. But here, we only constrain the sources. Each query emits exactly one unit (a Dirac unit impulse) of attention, and we let the optimization decide how that mass gets distributed across the keys. There are no constraints on the target side. This makes it a one-sided transport problem, which turns out to be exactly what attention needs. If we had fixed both marginals, we would have gotten something else entirely, something that wouldn't match the softmax structure at all.
Memory is unreliable, especially when it comes to invention. I don't remember exactly the precise \(\textit{Eureka}!\) moment of stumbling across this equivalence, though I suspect tangential work on Sinkhorn attentionSinkhorn attention, introduced by Tay et al. in "Sparse Sinkhorn Attention," uses the Sinkhorn algorithm (the iterative alternating row/column normalization used to solve entropic optimal transport) to approximate attention with linear complexity. The connection to optimal transport was already implicit in their approach. warmed me up. Looking back, I can see the proof clearly.
Apparently this is not uncommon. It reminds me of Unix. Brian Kernighan admits he coined "Unics" as a pun during the early Bell Labs days, but even he and his collaborators couldn't determine exactly who decided on the final spelling "Unix." Ritchie and McIlroy credited Kernighan; Peter Neumann recalled that Bell Labs PR likely influenced the change to avoid the "eunuchs" joke. No one remembers. (Wikipedia)
What happens is that thinking is computationally expensive: you explore dead ends, trace connections that lead nowhere, hold contradictory possibilities simultaneously. When you arrive at a solution, you amortize the cost by compressing the search path into the final answer and discarding intermediate states. This is your mind pruning its search history after convergence, like AlphaGo discarding its tree search after the move is chosen. The moment you see it, the path to seeing it disappears. Once the connection is made, the confusion that preceded it becomes incomprehensible because the information-theoretic cost of preserving that confusion exceeds any utility it might have held. \(\blacksquare\)