Complete Characterization of Gauge Symmetries in Transformer Architectures
Abstract
Modern Transformers possess redundant parameter symmetries that leave their function unchanged. We establish the complete gauge group structure for the canonical Transformer family, which encompasses standard architectures including GPT-2, BERT, LLaMA, and Qwen. For canonical Transformers with standard multi-head attention, we prove global maximality: the gauge group equals exactly Gmax = ((GL(dk))^h × (GL(dv))^h) ⋊ Sh on the generic stratum where projection matrices have full column rank and head-wise attention controllability holds. For architectures with rotary position embeddings (RoPE) or relative encodings, as used in LLaMA and Qwen, the gauge group becomes GRoPE = ((CRoPE)^h × (GL(dv))^h) ⋊ Sh where CRoPE is the commutant of the position-dependent rotations—typically reducing to (GL(1,ℂ))^{dk/2} for standard RoPE implementations. We prove maximality through three key results: characterizing the Lie algebra of infinitesimal symmetries as 𝔤max = ⨁{i=1}^h 𝔤𝔩(dk) ⊕ ⨁{i=1}^h 𝔤𝔩(dv) for canonical models, establishing that attention weights must be preserved up to head permutation under gauge equivalence, and demonstrating that query–key and value–output transformations necessarily factorize independently. These gauge symmetries persist through LayerNorm and extend to complete architectures, with the full model gauge group being GModel = ∏{l=1}^L GLayer^{(l)} Our characterization reveals over 1.1 million redundant dimensions in a 110M parameter Transformer Base model. Experiments on pretrained GPT-2 models from 124M to 1.5B parameters confirm that valid gauge transformations preserve model outputs to machine precision, while invalid transformations produce large errors, empirically supporting maximality.