Timezone: »
Multi-head attention empowers the recent success of transformers, the state-of-the-art models that have achieved remarkable success in sequence modeling and beyond. These attention mechanisms compute the pairwise dot products between the queries and keys, which results from the use of unnormalized Gaussian kernels with the assumption that the queries follow a mixture of Gaussian distribution. There is no guarantee that this assumption is valid in practice. In response, we first interpret attention in transformers as a nonparametric kernel regression. We then propose the FourierFormer, a new class of transformers in which the dot-product kernels are replaced by the novel generalized Fourier integral kernels. Different from the dot-product kernels, where we need to choose a good covariance matrix to capture the dependency of the features of data, the generalized Fourier integral kernels can automatically capture such dependency and remove the need to tune the covariance matrix. We theoretically prove that our proposed Fourier integral kernels can efficiently approximate any key and query distributions. Compared to the conventional transformers with dot-product attention, FourierFormers attain better accuracy and reduce the redundancy between attention heads. We empirically corroborate the advantages of FourierFormers over the baseline transformers in a variety of practical applications including language modeling and image classification.
Author Information
Tan Nguyen (University of California, Los Angeles)
Minh Pham (University of California, Los Angeles)
Tam Nguyen (University of Texas at Austin)
Khai Nguyen (University of Texas, Austin)
Stanley Osher (UCLA)
Nhat Ho (University of Texas at Austin)
More from the Same Authors
-
2022 : Statistical and Computational Complexities of BFGS Quasi-Newton Method for Generalized Linear Models »
Qiujiang Jin · Aryan Mokhtari · Nhat Ho · Tongzheng Ren -
2022 Poster: Amortized Projection Optimization for Sliced Wasserstein Generative Models »
Khai Nguyen · Nhat Ho -
2022 Poster: Revisiting Sliced Wasserstein on Images: From Vectorization to Convolution »
Khai Nguyen · Nhat Ho -
2022 Poster: Stochastic Multiple Target Sampling Gradient Descent »
Hoang Phan · Ngoc Tran · Trung Le · Toan Tran · Nhat Ho · Dinh Phung -
2022 Poster: Beyond black box densities: Parameter learning for the deviated components »
Dat Do · Nhat Ho · XuanLong Nguyen -
2022 Poster: Improving Neural Ordinary Differential Equations with Nesterov's Accelerated Gradient Method »
Ho Huu Nghia Nguyen · Tan Nguyen · Huyen Vo · Stanley Osher · Thieu Vo -
2022 Poster: Improving Transformer with an Admixture of Attention Heads »
Tan Nguyen · Tam Nguyen · Hai Do · Khai Nguyen · Vishwanath Saragadam · Minh Pham · Khuong Duy Nguyen · Nhat Ho · Stanley Osher -
2021 : Stan Osher Talk »
Stanley Osher -
2021 Poster: Structured Dropout Variational Inference for Bayesian Neural Networks »
Son Nguyen · Duong Nguyen · Khai Nguyen · Khoat Than · Hung Bui · Nhat Ho -
2021 Poster: On Robust Optimal Transport: Computational Complexity and Barycenter Computation »
Khang Le · Huy Nguyen · Quang M Nguyen · Tung Pham · Hung Bui · Nhat Ho -
2021 Poster: FMMformer: Efficient and Flexible Transformer via Decomposed Near-field and Far-field Attention »
Tan Nguyen · Vai Suliafu · Stanley Osher · Long Chen · Bao Wang -
2021 Poster: Heavy Ball Neural Ordinary Differential Equations »
Hedi Xia · Vai Suliafu · Hangjie Ji · Tan Nguyen · Andrea Bertozzi · Stanley Osher · Bao Wang -
2020 Poster: Projection Robust Wasserstein Distance and Riemannian Optimization »
Tianyi Lin · Chenyou Fan · Nhat Ho · Marco Cuturi · Michael Jordan -
2020 Poster: Fixed-Support Wasserstein Barycenters: Computational Hardness and Fast Algorithm »
Tianyi Lin · Nhat Ho · Xi Chen · Marco Cuturi · Michael Jordan -
2020 Spotlight: Projection Robust Wasserstein Distance and Riemannian Optimization »
Tianyi Lin · Chenyou Fan · Nhat Ho · Marco Cuturi · Michael Jordan -
2020 Poster: MomentumRNN: Integrating Momentum into Recurrent Neural Networks »
Tan Nguyen · Richard Baraniuk · Andrea Bertozzi · Stanley Osher · Bao Wang -
2019 Poster: ResNets Ensemble via the Feynman-Kac Formalism to Improve Natural and Robust Accuracies »
Bao Wang · Zuoqiang Shi · Stanley Osher -
2018 Poster: Deep Neural Nets with Interpolating Function as Output Activation »
Bao Wang · Xiyang Luo · Zhen Li · Wei Zhu · Zuoqiang Shi · Stanley Osher