Poster
in
Workshop: NeurIPS 2023 Workshop on Diffusion Models

VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing

Paul Couairon ⋅ Clément Rambour ⋅ Jean-Emmanuel HAUGEARD ⋅ Nicolas THOME

Project Page [ Poster] [ OpenReview]

Abstract

Recently, diffusion-based generative models have achieved remarkable success for image generation and edition. However, their use for video editing still faces important limitations. This paper introduces VidEdit, a novel method for zero-shot text-based video editing ensuring strong temporal and spatial consistency. Firstly, we propose to combine atlas-based and pre-trained text-to-image diffusion models to provide a training-free and efficient editing method, which by design fulfills temporal smoothness. Secondly, we leverage off-the-shelf panoptic segmenters along with edge detectors and adapt their use for conditioned diffusion-based atlas editing. This ensures a fine spatial control on targeted regions while strictly preserving the structure of the original video. Quantitative and qualitative experiments show that VidEdit outperforms state-of-the-art methods on DAVIS dataset, regarding semantic faithfulness, image preservation, and temporal consistency metrics. With this framework, processing a single video only takes approximately one minute, and it can generate multiple compatible edits based on a unique text prompt.

Chat is not available.