Poster
Compositional 3D-aware Video Generation with LLM Director
Hanxin Zhu · Tianyu He · Anni Tang · Junliang Guo · Zhibo Chen · Jiang Bian
East Exhibit Hall A-C #2400
Significant progress has been made in text-to-video generation through the use of powerful generative models and large-scale internet data. However, substantial challenges remain in precisely controlling individual elements within the generated video, such as the movement and appearance of specific characters and the manipulation of viewpoints. In this work, we propose a novel paradigm that generates each element in 3D representation separately and then composites them with priors from Large Language Models (LLMs) and 2D diffusion models. Specifically, given an input textual query, our scheme consists of four stages: 1) we leverage the LLMs as the director to first decompose the complex query into several sub-queries, where each sub-query describes each element of the generated video; 2) to generate each element, pre-trained models are invoked by the LLMs to obtain the corresponding 3D representation; 3) to composite the generated 3D representations, we prompt multi-modal LLMs to produce coarse guidance on the scale, location, and trajectory of different objects; 4) to make the results adhere to natural distribution, we further leverage 2D diffusion priors and use score distillation sampling to refine the composition. Extensive experiments demonstrate that our method can generate high-fidelity videos from text with flexible control over each element.
Live content is unavailable. Log in and register to view live content