Gauge Fiber Bundle Geometry of Transformers
Abstract
We give a geometry-first account of Transformers with GeLU. Building on a companion NeurReps paper that completely characterizes the head-wise gauge symmetries of multi-head attention, we treat the maximal head-wise symmetry group as given and study the induced geometry on the resulting quotient of functionally distinct models. On a generic regular set of parameters, this symmetry group acts freely and properly, so the parameter space fibers over a quotient manifold with gauge orbits as fibers. We establish an Ehresmann connection using the ambient Euclidean metric, which resolves the degeneracy of the Fisher–Rao (FR) metric along gauge directions. This framework clarifies that the natural gradient is the horizontal Riesz representative of the Euclidean gradient with respect to the FR geometry on the quotient. We show the connection has generically nonzero curvature, implying path-dependent holonomy in parameter updates. We also clarify the roles of the Attention (MHA) and FFN blocks: while MHA parameters possess gauge symmetry, FFN gradients are strictly horizontal as the FFN parameters are invariant under the MHA gauge group. We turn these ideas into practical diagnostics—a gauge-aware gradient split and a small-loop holonomy estimator—and report consistency checks aligning with the theory. Architectural choices such as RoPE appear as principled gauge reductions (e.g., per-head Q/K dimension from dₖ² to dₖ).