Timezone: »

Jump Self-attention: Capturing High-order Statistics in Transformers
Haoyi Zhou · Siyang Xiao · Shanghang Zhang · Jieqi Peng · Shuai Zhang · Jianxin Li

Wed Nov 30 09:00 AM -- 11:00 AM (PST) @ Hall J #216

The recent success of Transformer has benefited many real-world applications, with its capability of building long dependency through pairwise dot-products. However, the strong assumption that elements are directly attentive to each other limits the performance of tasks with high-order dependencies such as natural language understanding and Image captioning. To solve such problems, we are the first to define the Jump Self-attention (JAT) to build Transformers. Inspired by the pieces moving of English Draughts, we introduce the spectral convolutional technique to calculate JAT on the dot-product feature map. This technique allows JAT's propagation in each self-attention head and is interchangeable with the canonical self-attention. We further develop the higher-order variants under the multi-hop assumption to increase the generality. Moreover, the proposed architecture is compatible with the pre-trained models. With extensive experiments, we empirically show that our methods significantly increase the performance on ten different tasks.

Author Information

Haoyi Zhou (Beihang University)
Siyang Xiao (Beijing University of Aeronautics and Astronautics)
Shanghang Zhang (UC Berkeley)
Jieqi Peng (Beihang University)
Shuai Zhang (Beihang University)
Jianxin Li (Beihang University)

More from the Same Authors