Teaching Vision Language Models to See, Forget, and Imagine
Abstract
Recent advances in vision language models have enabled rich multimodal understanding and generation, yet challenges remain in controllability, personalization, and responsible adaptation. This talk presents our recent efforts toward making such models more steerable and interpretable—teaching them to see, forget, and imagine. I will discuss methods for personalized adaptation that align multimodal representations with user-specific contexts, concept editing and erasing techniques for targeted knowledge modification and unlearning, and video motion editing frameworks that extend controllable generation into the temporal domain. Together, these directions outline a path toward controllable multimodal intelligence, where large models evolve from passive perception systems to dynamic, human-aligned collaborators.