Timezone: »

The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit
Lorenzo Noci · Chuning Li · Mufan Li · Bobby He · Thomas Hofmann · Chris Maddison · Dan Roy

Wed Dec 13 08:45 AM -- 10:45 AM (PST) @ Great Hall & Hall B1+B2 #822

In deep learning theory, the covariance matrix of the representations serves as aproxy to examine the network’s trainability. Motivated by the success of Transform-ers, we study the covariance matrix of a modified Softmax-based attention modelwith skip connections in the proportional limit of infinite-depth-and-width. Weshow that at initialization the limiting distribution can be described by a stochasticdifferential equation (SDE) indexed by the depth-to-width ratio. To achieve awell-defined stochastic limit, the Transformer’s attention mechanism is modifiedby centering the Softmax output at identity, and scaling the Softmax logits by awidth-dependent temperature parameter. We examine the stability of the networkthrough the corresponding SDE, showing how the scale of both the drift and diffu-sion can be elegantly controlled with the aid of residual connections. The existenceof a stable SDE implies that the covariance structure is well-behaved, even for verylarge depth and width, thus preventing the notorious issues of rank degeneracyin deep attention models. Finally, we show, through simulations, that the SDEprovides a surprisingly good description of the corresponding finite-size model.We coin the name shaped Transformer for these architectural modifications.

Author Information

Lorenzo Noci (ETH Zürich)
Chuning Li (University of Toronto)
Mufan Li (Princeton University)
Bobby He (University of Oxford)
Thomas Hofmann (ETH Zurich)
Chris Maddison (University of Toronto)
Dan Roy (U of Toronto; Vector)

More from the Same Authors