NeurIPS Transformers as Support Vector Machines

Poster
in
Workshop: Mathematics of Modern Machine Learning (M3L)

Transformers as Support Vector Machines

Davoud Ataee Tarzanagh · Yingcong Li · Christos Thrampoulidis · Samet Oymak

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract: The transformer architecture has led to revolutionary advancements in NLP. The attention layer within the transformer admits a sequence of input tokens

X

$X$ and makes them interact through pairwise similarities computed as

softmax (X Q K^{⊤} X^{⊤})

$\texttt{softmax}(XQK^\top X^\top)$ , where

(K, Q)

$(K,Q)$ are the trainable key-query parameters. In this work, we establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem that separates optimal input tokens from non-optimal tokens using linear constraints on the outer-products of token pairs. This formalism allows us to characterize the implicit bias of 1-layer transformers optimized with gradient descent: (1) Optimizing the attention layer, parameterized by

(K, Q)

$(K,Q)$ , with vanishing regularization, converges in direction to an SVM solution minimizing the nuclear norm of the combined parameter

W := K Q^{⊤}

$W:=KQ^\top$ . Instead, directly parameterizing by

W

$W$ minimizes a Frobenius norm SVM objective. (2) Complementing this, for

W

$W$ -parameterization, we prove the local/global directional convergence of gradient descent under suitable geometric conditions, and propose a more general SVM equivalence that predicts the implicit bias of attention with nonlinear heads/MLPs.

Chat is not available.

Poster in Workshop: Mathematics of Modern Machine Learning (M3L)

Transformers as Support Vector Machines

Davoud Ataee Tarzanagh · Yingcong Li · Christos Thrampoulidis · Samet Oymak

Poster
in
Workshop: Mathematics of Modern Machine Learning (M3L)