EgoAnimate: Generating Human Animations from Egocentric top-down Views via Controllable Latent Diffusion Models
Abstract
An ideal digital telepresence experience requires the accurate replication of a1person’s body, clothing, and movements. In order to capture and transfer these2movements into virtual reality, the egocentric (first-person) perspective can be3adopted, which makes it feasible to rely on a portable and cost-effective stand-4alone device that requires no additional front-view cameras. However, this per-5spective also introduces considerable challenges, particularly in learning tasks, as6egocentric data often contains severe occlusions and distorted body proportions.7Human appearance and avatar reconstruction from egocentric views remains rela-8tively underexplored, and approaches that leverage generative priors are rare. This9gap contributes to limited out-of-distribution generalization and greater data and10training requirements. We introduce a controllable latent-diffusion framework11that maps egocentric inputs to a canonical exocentric (frontal T-pose) representa-12tion from which animatable avatars are reconstructed. To our knowledge, this is13the first system to employ a generative diffusion backbone for egocentric avatar14reconstruction. Building on a Stable Diffusion prior with explicit pose/shape con-15ditioning, our method reduces training/data burden and improves generalization16to in-the-wild inputs. The idea of synthesizing fully occluded parts of an object17has been widely explored in various domains. In particular, models such as SiTH18and MagicMan have demonstrated successful 360-degree reconstruction from a19single frontal image. Inspired by these approaches, we propose a pipeline that20reconstructs a frontal view from a highly occluded top-down image using Control-21Net and a Stable Diffusion backbone enabling the synthesis of novel views. Our22objective is to map a single egocentric top-down image to a canonical frontal (e.g.,23T-pose) representation that can be directly consumed by an image-to-motion model24to produce an animatable avatar. This enables motion synthesis from minimal25egocentric input and supports more accessible, data-efficient, and generalizable26telepresence systems.