Skip to yearly menu bar Skip to main content

Oral Session

Oral 5D Vision

Room R06-R09 (level 2)

Thu 14 Dec. 8:00 - 8:15 PST

Visual Instruction Tuning

Haotian Liu · Chunyuan Li · Qingyang Wu · Yong Jae Lee

Instruction tuning large language models (LLMs) using machine-generated instruction-following data has been shown to improve zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. We present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and an LLM for general-purpose visual and language understanding. To facilitate future research on visual instruction following, we construct two evaluation benchmarks with diverse and challenging application-oriented tasks. Our experiments show that LLaVA demonstrates impressive multimodal chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4 generated visual instruction tuning data, our model, and code publicly available.

Thu 14 Dec. 8:15 - 8:30 PST

EgoEnv: Human-centric environment representations from egocentric video

Tushar Nagarajan · Santhosh Kumar Ramakrishnan · Ruta Desai · James Hillis · Kristen Grauman

First-person video highlights a camera-wearer's activities in the context of their persistent environment. However, current video understanding approaches reason over visual features from short video clips that are detached from the underlying physical space and capture only what is immediately visible. To facilitate human-centric environment understanding, we present an approach that links egocentric video and the environment by learning representations that are predictive of the camera-wearer's (potentially unseen) local surroundings. We train such models using videos from agents in simulated 3D environments where the environment is fully observable, and test them on human-captured real-world videos from unseen environments. On two human-centric video tasks, we show that models equipped with our environment-aware features consistently outperform their counterparts with traditional clip features. Moreover, despite being trained exclusively on simulated videos, our approach successfully handles real-world videos from HouseTours and Ego4D, and achieves state-of-the-art results on the Ego4D NLQ challenge.

Thu 14 Dec. 8:30 - 8:45 PST

DataComp: In search of the next generation of multimodal datasets

Samir Yitzhak Gadre · Gabriel Ilharco · Alex Fang · Jonathan Hayase · Georgios Smyrnis · Thao Nguyen · Ryan Marten · Mitchell Wortsman · Dhruba Ghosh · Jieyu Zhang · Eyal Orgad · Rahim Entezari · Giannis Daras · Sarah Pratt · Vivek Ramanujan · Yonatan Bitton · Kalyani Marathe · Stephen Mussmann · Richard Vencu · Mehdi Cherti · Ranjay Krishna · Pang Wei Koh · Olga Saukh · Alexander Ratner · Shuran Song · Hannaneh Hajishirzi · Ali Farhadi · Romain Beaumont · Sewoong Oh · Alex Dimakis · Jenia Jitsev · Yair Carmon · Vaishaal Shankar · Ludwig Schmidt

Multimodal datasets are a critical component in recent breakthroughs such as CLIP, Stable Diffusion and GPT-4, yet their design does not receive the same research attention as model architectures or training algorithms. To address this shortcoming in the machine learning ecosystem, we introduce DataComp, a testbed for dataset experiments centered around a new candidate pool of 12.8 billion image-text pairs from Common Crawl. Participants in our benchmark design new filtering techniques or curate new data sources and then evaluate their new dataset by running our standardized CLIP training code and testing the resulting model on 38 downstream test sets. Our benchmark consists of multiple compute scales spanning four orders of magnitude, which enables the study of scaling trends and makes the benchmark accessible to researchers with varying resources. Our baseline experiments show that the DataComp workflow leads to better training sets. Our best baseline, DataComp-1B, enables training a CLIP ViT-L/14 from scratch to 79.2% zero-shot accuracy on ImageNet, outperforming OpenAI's CLIP ViT-L/14 by 3.7 percentage points while using the same training procedure and compute. We release \datanet and all accompanying code at