Skip to yearly menu bar Skip to main content


Qualcomm AI Research

Expo Demonstration

Generating group photos of multiple people from text and reference images

Ron Tindall

[ ]
Tue 2 Dec noon PST — 3 p.m. PST

Abstract:

Reference-based multi-human image generation is emerging as a critical capability for personalization, synthetic data creation, and benchmarking generative models. Unlike single-subject generation, this task requires compositional reasoning to place multiple individuals—each with distinct identities—into a coherent scene guided by a text prompt. Existing models often fail to preserve identities or maintain spatial fidelity, which limits their applicability for real-world scenarios such as social content creation or training vision systems. x000D
x000D
Our demo addresses these challenges by showcasing a state-of-the-art system for reference-based multi-human generation. The system takes reference images of multiple individuals and a text description of the desired scene, then produces a high-quality image featuring all participants in context. Built on the Flux-Kontext backbone and trained using synthetic data from DisCo (arXiv:2510.01399), our RL-based approach optimizes multiple rewards including Human Preference Score (HPS3) and Average ID Similarity. Evaluation on MultiHuman-Testbench (arXiv:2506.20879) confirms state-of-the-art performance. x000D
x000D
This demo showcases fast generation on a laptop powered by a Snapdragon processor, highlighting the efficiency and scalability of our solution.

Chat is not available.