Clustering in Self-Attention Dynamics with Wasserstein-Fisher-Rao Gradient Flows
Ziang Chen · Yury Polyanskiy · Philippe Rigollet
Abstract
Motivated by the Wasserstein gradient flow structure and the clustering behavior of transformers (Geshkovski et al., 2023; Geshkovski et al., 2025), we study the Wasserstein–Fisher–Rao (WFR) gradient flow for minimizing an energy functional on the sphere induced by a pairwise potential. We show that the WFR gradient flow for self-attention dynamics almost surely converges to a single-cluster configuration, avoiding entrapment in multi-cluster states. Furthermore, we extend this result to the Kuramoto model. Numerical experiments corroborate our theoretical findings.
Chat is not available.
Successful Page Load