Expo Demonstration
Generating group photos of multiple people from text and reference images
Ron Tindall
Upper Level Room 29A-D
Reference-based multi-human image generation is emerging as a critical capability for personalization, synthetic data creation, and benchmarking generative models. Unlike single-subject generation, this task requires compositional reasoning to place multiple individuals—each with distinct identities—into a coherent scene guided by a text prompt. Existing models often fail to preserve identities or maintain spatial fidelity, which limits their applicability for real-world scenarios such as social content creation or training vision systems. x000D
x000D
Our demo addresses these challenges by showcasing a state-of-the-art system for reference-based multi-human generation. The system takes reference images of multiple individuals and a text description of the desired scene, then produces a high-quality image featuring all participants in context. Built on the Flux-Kontext backbone and trained using synthetic data from DisCo (arXiv:2510.01399), our RL-based approach optimizes multiple rewards including Human Preference Score (HPS3) and Average ID Similarity. Evaluation on MultiHuman-Testbench (arXiv:2506.20879) confirms state-of-the-art performance. x000D
x000D
This demo showcases fast generation on a laptop powered by a Snapdragon processor, highlighting the efficiency and scalability of our solution.
Live content is unavailable. Log in and register to view live content