Lyapunov Exponents for Attention Composition

AI Research • January 9, 2026

First Lyapunov exponent framework for analyzing eigenvalue dynamics in composed attention layers, bridging transformer theory with dynamical systems.

Python

Deep Learning

Transformers

Dynamical Systems

Research Paper

Key Features

•First computation of full Lyapunov spectrum for attention products
•Proof that Lambda_1 = 0 exactly and Lambda_k < 0 for k > 1
•Temperature-spectral gap relationship quantification
•Refined closed-form formula for rank collapse prediction
•Residual connection mechanism analysis (2.4x contraction reduction)
•All theoretical results experimentally verified

Abstract

I develop the first Lyapunov exponent framework for analyzing eigenvalue dynamics in composed attention layers. Building on foundational rank collapse results, I provide novel tools connecting transformer theory to dynamical systems.

Theoretical Framework

For a sequence of attention matrices $A_{1}, A_{2}, \dots, A_{L}$ , the $k$ -th Lyapunov exponent is defined as:

$Λ_{k} = lim_{L \to \infty} \frac{1}{L} lo g ∣ σ_{k} (P_{L}) ∣$

where $σ_{k}$ denotes the $k$ -th singular value of the product $P_{L} = A_{1} A_{2} \dots A_{L}$ .

Key Results

Theorem 1: Dominant Lyapunov Exponent

For any sequence of row-stochastic attention matrices:

$Λ_{1} = 0$

Proof: The all-ones vector $1$ satisfies $A 1 = 1$ for any stochastic $A$ , so $Λ_{1} = lim_{L \to \infty} \frac{1}{L} lo g (1) = 0$ .

Theorem 2: Contraction Exponents

For i.i.d. random attention matrices with spectral gap $γ < 1$ :

$Λ_{k} < 0 for all k > 1$

Theorem 3: Collapse Prediction Formula

The number of layers until effective rank drops below threshold $r$ is:

$L_{collapse} = \frac{l o g ( \frac{r - 1}{d - 1} )}{l o g γ}$

where $d$ is dimension and $γ = ∣ λ_{2} ∣$ is the second eigenvalue magnitude.

Experimental Results

Lyapunov Spectrum (d=50, T=1.0, L=100 layers):

$Λ_{1} = 0.000000$ (std dev $< 1 0^{- 15}$ ), verified to machine precision
$Λ_{2} = - 1.790$ (std dev $0.013$ )
$Λ_{3} = - 1.805$ (std dev $0.011$ )

Temperature Effect on Spectral Collapse

Lower temperature leads to sharper attention and faster collapse:

T = 0.5: $∣ λ_{2} ∣ = 0.417$ , $Λ_{2} = - 0.875$ (Slowest)
T = 1.0: $∣ λ_{2} ∣ = 0.195$ , $Λ_{2} = - 1.636$ (Moderate)
T = 2.0: $∣ λ_{2} ∣ = 0.080$ , $Λ_{2} = - 2.524$ (Fast)

Residual Connection Analysis

Residual connections reduce $∣ Λ_{2} ∣$ by factor $\approx 2.4 \times$ , slowing information loss through layers.

With residual connections, gradients at layer 1 improve from $5.5 \times 1 0^{- 7}$ to $2.7 \times 1 0^{- 3}$ , a $\sim 5000 \times$ improvement.

Citation

@misc{gibbs2026lyapunov,
  title={Lyapunov Exponents for Attention Composition: A Dynamical Systems Perspective on Deep Transformers},
  author={Gibbs, Tyler},
  year={2026},
  publisher={Zenodo},
  doi={10.5281/zenodo.18202128}
}