Timezone: »

All are Worth Words: a ViT Backbone for Score-based Diffusion Models
Fan Bao · Chongxuan LI · Yue Cao · Jun Zhu
Event URL: https://openreview.net/forum?id=WfkBiPO5dsG »

Vision transformers (ViT) have shown promise in various vision tasks including low-level ones while the U-Net remains dominant in score-based diffusion models. In this paper, we perform a systematical empirical study on the ViT-based architectures in diffusion models. Our results suggest that adding extra long skip connections (like the U-Net) to ViT is crucial to diffusion models. The new ViT architecture, together with other improvements, is referred to as U-ViT. On several popular visual datasets, U-ViT achieves competitive generation results to SOTA U-Net while requiring comparable amount of parameters and computation if not less.

Author Information

Fan Bao (Tsinghua University)
Chongxuan LI (Renmin University of China)

Assistant Professor @ RUC

Yue Cao (Microsoft Research)
Jun Zhu (Tsinghua University)

More from the Same Authors