Timezone: »

HRFormer: High-Resolution Vision Transformer for Dense Predict
YUHUI YUAN · Rao Fu · Lang Huang · Weihong Lin · Chao Zhang · Xilin Chen · Jingdong Wang

Wed Dec 08 04:30 PM -- 06:00 PM (PST) @

We present a High-Resolution Transformer (HRFormer) that learns high-resolution representations for dense prediction tasks, in contrast to the original Vision Transformer that produces low-resolution representations and has high memory and computational cost. We take advantage of the multi-resolution parallel design introduced in high-resolution convolutional networks (HRNet [45]), along with local-window self-attention that performs self-attention over small non-overlapping image windows [21], for improving the memory and computation efficiency. In addition, we introduce a convolution into the FFN to exchange information across the disconnected image windows. We demonstrate the effectiveness of the HighResolution Transformer on both human pose estimation and semantic segmentation tasks, e.g., HRFormer outperforms Swin transformer [27] by 1.3 AP on COCO pose estimation with 50% fewer parameters and 30% fewer FLOPs. Code is available at: https://github.com/HRNet/HRFormer

Author Information

YUHUI YUAN (Microsoft Research)
Rao Fu (Brown University)
Lang Huang (Peking University)

I am currently a second-year Ph.D. student at the Department of Information & Communication Engineering, The University of Tokyo. Prior to that, I received a Master’s degree from the Department of Machine Intelligence, School of Electronics Engineering and Computer Science, Peking University in 2021. My research interests include self-supervised representation learning, robust learning from noisy data, and vision transformers.

Weihong Lin (Microsoft)
Chao Zhang (Peking University)
Xilin Chen (Institute of Computing Technology, Chinese Academy of Sciences)
Jingdong Wang (Microsoft Research,)

More from the Same Authors