Power law attention biases for molecular transformers
Jay Shen
Abstract
Transformers are the go-to architecture for many data modalities. While they have been applied extensively to molecular property prediction, they do not dominate like they do in language and vision. One cause may be the lack of effective structural biases that capture relevant interatomic relationships. Here, we investigate attention biases as a simple and natural way to encode that structure. Motivated by physical power laws, we propose a family of simple attention biases $b_{ij} = p \log|| \mathbf{r}_i - \mathbf{r}_j||$ which weights attention probabilities according to interatomic distances. On the QM9 dataset, this approach outperforms positional encodings and graph attention while remaining competitive with more complex Gaussian kernel biases. We also show that good attention biases can compensate for a complete ablation of scaled dot-product attention, suggesting a low-cost path toward interpretable molecular transformers.
Chat is not available.
Successful Page Load