Poster
FreeLong: Training-Free Long Video Generation with SpectralBlend Temporal Attention
Yu Lu · Yuanzhi Liang · Linchao Zhu · Yi Yang
East Exhibit Hall A-C #4803
Video diffusion models have made substantial progress in various video generation applications. However, training models for long video tasks requires significant computational and data resources, posing a challenge to developing long video diffusion models.This paper investigates a straightforward and training-free approach to adapt an existing short video diffusion model (\eg pre-trained on 16 frames videos) for consistent long video generation (\eg 128 frames). Our preliminary observation indicate that the short video model's temporal attention mechanism ensures temporal consistency but significantly degrades the fidelity and spatial-temporal details of the videos. Our further investigation reveals that the limitation is mainly due to the distortion of high-frequency components in generated long videos. Motivated by this finding, we propose a straightforward yet effective solution: the local-global SpectralBlend Temporal Attention (SB-TA). This approach smooths the frequency distribution of long video features during the denoising process by blending the low-frequency components of global video features with the high-frequency components of local video features. This fusion enhances both the consistency and fidelity of long video generation. Based on the SB-TA, we developed a new training-free model named FreeLong, which sets a new performance benchmark compared to existing long video generation models. We evaluated FreeLong on multiple base video diffusion models and observed significant improvements. Additionally, our method supports coherent multi-prompt generation, ensuring both visual coherence and seamless transitions between scenes. Anonymous website for the project: https://freelongvideo.github.io
Live content is unavailable. Log in and register to view live content