Skip to yearly menu bar Skip to main content


Poster

DiTFastAtten: Accelerate Diffusion Transformers Through Efficient Attention Computation

Zhihang Yuan · Lu Pu · Hanling Zhang · Xuefei Ning · Linfeng Zhang · Tianchen Zhao · Shengen Yan · Guohao Dai · Yu Wang

East Exhibit Hall A-C #4705
[ ]
Wed 11 Dec 11 a.m. PST — 2 p.m. PST

Abstract:

Diffusion Transformers (DiT) have emerged as a powerful tool for image and video generation tasks. However, their quadratic computational complexity due to the self-attention mechanism poses a significant challenge, particularly for high-resolution and long video tasks. This paper mitigate the computational bottleneck in DiT models by introducing a novel post-training model compression method. We identify three key redundancies in the attention computation during DiT inference: 1. spatial redundancy, where many attention heads focus on local information; 2. temporal redundancy, with high similarity between neighboring steps' attention outputs; 3. conditional redundancy, where conditional and unconditional inferences exhibit significant similarity. To tackle these redundancies, we propose three techniques: 1. Window Attention with Residual Caching to reduce spatial redundancy; 2. Temporal Similarity Reduction to exploit the similarity between steps; 3. Conditional Redundancy Elimination to skip redundant computations during conditional generation. Our method compresses the model FLOPs to 35\% of the original model. This work offers a practical solution for deploying DiT models in resource-constrained environments.

Live content is unavailable. Log in and register to view live content