NeurIPS Expo Demonstration Generating group photos of multiple people from text and reference images

Expo Demonstration

Generating group photos of multiple people from text and reference images

Ron Tindall

Upper Level Room 29A-D

[ Abstract ]

Tue 2 Dec noon PST — 3 p.m. PST

Abstract:

Reference-based multi-human image generation is emerging as a critical capability for personalization, synthetic data creation, and benchmarking generative models. Unlike single-subject generation, this task requires compositional reasoning to place multiple individuals—each with distinct identities—into a coherent scene guided by a text prompt. Existing models often fail to preserve identities or maintain spatial fidelity, which limits their applicability for real-world scenarios such as social content creation or training vision systems. x000D
x000D
Our demo addresses these challenges by showcasing a state-of-the-art system for reference-based multi-human generation. The system takes reference images of multiple individuals and a text description of the desired scene, then produces a high-quality image featuring all participants in context. Built on the Flux-Kontext backbone and trained using synthetic data from DisCo (arXiv:2510.01399), our RL-based approach optimizes multiple rewards including Human Preference Score (HPS3) and Average ID Similarity. Evaluation on MultiHuman-Testbench (arXiv:2506.20879) confirms state-of-the-art performance. x000D
x000D
This demo showcases fast generation on a laptop powered by a Snapdragon processor, highlighting the efficiency and scalability of our solution.

Chat is not available.