Timezone: »
We introduce new methods for 1) accelerating and 2) stabilizing training for large language-vision models. 1) For acceleration, we introduce SwitchBack, a linear layer for int8 quantized training which provides a speed-up of 13-25% while matching the performance of bfloat16 training within 0.1 percentage points for the 1B parameter CLIP ViT-Huge---the largest int8 training to date. Our main focus is int8 as GPU support for float8 is rare, though we also analyze float8 training through simulation. While SwitchBack proves effective for float8, we show that standard techniques are also successful if the network is trained and initialized so that large feature magnitudes are discouraged, which we accomplish via layer-scale initialized with zeros. 2) For stability, we analyze loss spikes and find they consistently occur 1-8 iterations after the squared gradients become under-estimated by their AdamW second moment estimator. As a result, we recommend an AdamW-Adafactor hybrid which avoids loss spikes when training a CLIP ViT-Huge model and outperforms gradient clipping at the scales we test.
Author Information
Mitchell Wortsman (University of Washington)
Tim Dettmers (University of Washington)
Luke Zettlemoyer (University of Washington; Meta)
Ari Morcos (DatologyAI)
Ali Farhadi (University of Washington, Allen Institute for Artificial Intelligence)
Ludwig Schmidt (University of Washington)
More from the Same Authors
-
2021 : Are We Learning Yet? A Meta Review of Evaluation Failures Across Machine Learning »
Thomas Liao · Rohan Taori · Deborah Raji · Ludwig Schmidt -
2021 : Learning Background Invariance Improves Generalization and Robustness in Self Supervised Learning on ImageNet and Beyond »
Chaitanya Ryali · David Schwab · Ari Morcos -
2021 : Do ImageNet Classifiers Generalize to ImageNet? »
Benjamin Recht · Becca Roelofs · Ludwig Schmidt · Vaishaal Shankar -
2021 : Evaluating Machine Accuracy on ImageNet »
Vaishaal Shankar · Becca Roelofs · Horia Mania · Benjamin Recht · Ludwig Schmidt -
2021 : Measuring Robustness to Natural Distribution Shifts in Image Classification »
Rohan Taori · Achal Dave · Vaishaal Shankar · Nicholas Carlini · Benjamin Recht · Ludwig Schmidt -
2021 : Robust fine-tuning of zero-shot models »
Mitchell Wortsman · Gabriel Ilharco · Jong Wook Kim · Mike Li · Hanna Hajishirzi · Ali Farhadi · Hongseok Namkoong · Ludwig Schmidt -
2023 : PATHFINDER: Guided Search over Multi-Step Reasoning Paths »
Olga Golovneva · Sean O'Brien · Ramakanth Pasunuru · Tianlu Wang · Luke Zettlemoyer · Maryam Fazel-Zarandi · Asli Celikyilmaz -
2023 : FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation »
Sewon Min · Kalpesh Krishna · Xinxi Lyu · Mike Lewis · Scott Yih · Pang Wei Koh · Mohit Iyyer · Luke Zettlemoyer · Hannaneh Hajishirzi -
2023 : CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement »
Mohammadreza (Reza) Salehi · Mehrdad Farajtabar · Maxwell Horton · Fartash Faghri · Hadi Pouransari · Raviteja Vemulapalli · Oncel Tuzel · Ali Farhadi · Mohammad Rastegari · Sachin Mehta -
2023 : SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore »
Sewon Min · Suchin Gururangan · Eric Wallace · Weijia Shi · Hannaneh Hajishirzi · Noah Smith · Luke Zettlemoyer -
2023 : Data Filtering Networks »
Alex Fang · Albin Madappally Jose · Amit Jain · Ludwig Schmidt · Alexander Toshev · Vaishaal Shankar -
2023 : Retrieval-based Language Models Using a Multi-domain Datastore »
Rulin Shao · Sewon Min · Luke Zettlemoyer · Pang Wei Koh -
2023 : Detecting Pretraining Data from Large Language Models »
Weijia Shi · Anirudh Ajith · Mengzhou Xia · Yangsibo Huang · Daogao Liu · Terra Blevins · Danqi Chen · Luke Zettlemoyer -
2023 : SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore »
Sewon Min · Suchin Gururangan · Eric Wallace · Weijia Shi · Hannaneh Hajishirzi · Noah Smith · Luke Zettlemoyer -
2023 : MatFormer: Nested Transformer for Elastic Inference »
Fnu Devvrit · Sneha Kudugunta · Aditya Kusupati · Tim Dettmers · Kaifeng Chen · Inderjit Dhillon · Yulia Tsvetkov · Hannaneh Hajishirzi · Sham Kakade · Ali Farhadi · Prateek Jain -
2023 : CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement »
Mohammadreza (Reza) Salehi · Mehrdad Farajtabar · Maxwell Horton · Fartash Faghri · Hadi Pouransari · Raviteja Vemulapalli · Oncel Tuzel · Ali Farhadi · Mohammad Rastegari · Sachin Mehta -
2023 : Interactive Panel Discussion »
Nazneen Rajani · Tanya Roosta · Tim Dettmers · Minjia Zhang -
2023 : [Paper-Oral 7] MatFormer: Nested Transformer for Elastic Inference »
Fnu Devvrit · Sneha Kudugunta · Aditya Kusupati · Tim Dettmers · Kaifeng Chen · Inderjit Dhillon · Yulia Tsvetkov · Hanna Hajishirzi · Sham Kakade · Ali Farhadi · Prateek Jain -
2023 : Keynote Talk 2 »
Luke Zettlemoyer -
2023 : [Paper-Oral 3] Decoding Data Quality via Synthetic Corruptions: Embedding-guided Pruning of Code Data »
Yu Yang · Aaditya Singh · Mostafa Elhoushi · Anas Mahmoud · Kushal Tirumala · Fabian Gloeckle · Baptiste Roziere · Carole-Jean Wu · Ari Morcos · Newsha Ardalani -
2023 Poster: Benchmarking Distribution Shift in Tabular Data with TableShift »
Josh Gardner · Zoran Popovic · Ludwig Schmidt -
2023 Poster: GenEval: An object-focused framework for evaluating text-to-image alignment »
Dhruba Ghosh · Hannaneh Hajishirzi · Ludwig Schmidt -
2023 Poster: Objaverse-XL: A Universe of 10M+ 3D Objects »
Matt Deitke · Ruoshi Liu · Matthew Wallingford · Huong Ngo · Oscar Michel · Aditya Kusupati · Alan Fan · Christian Laforte · Vikram Voleti · Samir Yitzhak Gadre · Eli VanderBilt · Aniruddha Kembhavi · Carl Vondrick · Georgia Gkioxari · Kiana Ehsani · Ludwig Schmidt · Ali Farhadi -
2023 Poster: Does progress on ImageNet transfer to real-world datasets? »
Alex Fang · Simon Kornblith · Ludwig Schmidt -
2023 Poster: PUG: Photorealistic and Semantically Controllable Synthetic Data for Representation Learning »
Florian Bordes · Shashank Shekhar · Mark Ibrahim · Diane Bouchacourt · Pascal Vincent · Ari Morcos -
2023 Poster: D4: Improving LLM Pretraining via Document De-Duplication and Diversification »
Kushal Tirumala · Daniel Simig · Armen Aghajanyan · Ari Morcos -
2023 Poster: DataComp: In search of the next generation of multimodal datasets »
Samir Yitzhak Gadre · Gabriel Ilharco · Alex Fang · Jonathan Hayase · Georgios Smyrnis · Thao Nguyen · Ryan Marten · Mitchell Wortsman · Dhruba Ghosh · Jieyu Zhang · Eyal Orgad · Rahim Entezari · Giannis Daras · Sarah Pratt · Vivek Ramanujan · Yonatan Bitton · Kalyani Marathe · Stephen Mussmann · Richard Vencu · Mehdi Cherti · Ranjay Krishna · Pang Wei Koh · Olga Saukh · Alexander Ratner · Shuran Song · Hannaneh Hajishirzi · Ali Farhadi · Romain Beaumont · Sewoong Oh · Alex Dimakis · Jenia Jitsev · Yair Carmon · Vaishaal Shankar · Ludwig Schmidt -
2023 Oral: DataComp: In search of the next generation of multimodal datasets »
Samir Yitzhak Gadre · Gabriel Ilharco · Alex Fang · Jonathan Hayase · Georgios Smyrnis · Thao Nguyen · Ryan Marten · Mitchell Wortsman · Dhruba Ghosh · Jieyu Zhang · Eyal Orgad · Rahim Entezari · Giannis Daras · Sarah Pratt · Vivek Ramanujan · Yonatan Bitton · Kalyani Marathe · Stephen Mussmann · Richard Vencu · Mehdi Cherti · Ranjay Krishna · Pang Wei Koh · Olga Saukh · Alexander Ratner · Shuran Song · Hannaneh Hajishirzi · Ali Farhadi · Romain Beaumont · Sewoong Oh · Alex Dimakis · Jenia Jitsev · Yair Carmon · Vaishaal Shankar · Ludwig Schmidt -
2023 Poster: AdANNS: A Framework for Adaptive Semantic Search »
Aniket Rege · Aditya Kusupati · Sharan Ranjit S · Alan Fan · Qingqing Cao · Sham Kakade · Prateek Jain · Ali Farhadi -
2023 Poster: Characterizing the Impacts of Semi-supervised Learning for Weak Supervision »
Jeffrey Li · Jieyu Zhang · Ludwig Schmidt · Alexander Ratner -
2023 Poster: Distributed Inference and Fine-tuning of Large Language Models Over The Internet »
Alexander Borzunov · Max Ryabinin · Artem Chumachenko · Dmitry Baranchuk · Tim Dettmers · Younes Belkada · Pavel Samygin · Colin Raffel -
2023 Poster: Localized Symbolic Knowledge Distillation for Visual Commonsense Models »
Jae Sung Park · Jack Hessel · Khyathi Chandu · Paul Pu Liang · Ximing Lu · Peter West · Youngjae Yu · Qiuyuan Huang · Jianfeng Gao · Ali Farhadi · Yejin Choi -
2023 Poster: Toolformer: Language Models Can Teach Themselves to Use Tools »
Timo Schick · Jane Dwivedi-Yu · Roberto Dessi · Roberta Raileanu · Maria Lomeli · Eric Hambro · Luke Zettlemoyer · Nicola Cancedda · Thomas Scialom -
2023 Poster: Effective Robustness against Natural Distribution Shifts for Models with Different Training Data »
Zhouxing Shi · Nicholas Carlini · Ananth Balashankar · Ludwig Schmidt · Cho-Jui Hsieh · Alex Beutel · Yao Qin -
2023 Poster: Are aligned neural networks adversarially aligned? »
Nicholas Carlini · Milad Nasr · Christopher A. Choquette-Choo · Matthew Jagielski · Irena Gao · Pang Wei Koh · Daphne Ippolito · Florian Tramer · Ludwig Schmidt -
2023 Poster: LIMA: Less Is More for Alignment »
Chunting Zhou · Pengfei Liu · Puxin Xu · Srinivasan Iyer · Jiao Sun · Yuning Mao · Xuezhe Ma · Avia Efrat · Ping Yu · LILI YU · Susan Zhang · Gargi Ghosh · Mike Lewis · Luke Zettlemoyer · Omer Levy -
2023 Poster: MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers »
LILI YU · Daniel Simig · Colin Flaherty · Armen Aghajanyan · Luke Zettlemoyer · Mike Lewis -
2023 Oral: Toolformer: Language Models Can Teach Themselves to Use Tools »
Timo Schick · Jane Dwivedi-Yu · Roberto Dessi · Roberta Raileanu · Maria Lomeli · Eric Hambro · Luke Zettlemoyer · Nicola Cancedda · Thomas Scialom -
2023 Poster: QLoRA: Efficient Finetuning of Quantized LLMs »
Tim Dettmers · Artidoro Pagnoni · Ari Holtzman · Luke Zettlemoyer -
2023 Poster: Neural Priming for Sample-Efficient Adaptation »
Matthew Wallingford · Vivek Ramanujan · Alex Fang · Aditya Kusupati · Roozbeh Mottaghi · Aniruddha Kembhavi · Ludwig Schmidt · Ali Farhadi -
2023 Poster: On the Connection between Pre-training Data Diversity and Fine-tuning Robustness »
Vivek Ramanujan · Thao Nguyen · Sewoong Oh · Ali Farhadi · Ludwig Schmidt -
2023 Oral: QLoRA: Efficient Finetuning of Quantized LLMs »
Tim Dettmers · Artidoro Pagnoni · Ari Holtzman · Luke Zettlemoyer -
2023 Poster: Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with Text »
Wanrong Zhu · Jack Hessel · Anas Awadalla · Samir Yitzhak Gadre · Jesse Dodge · Alex Fang · Youngjae Yu · Ludwig Schmidt · William Yang Wang · Yejin Choi -
2023 Poster: VisIT-Bench: A Dynamic Benchmark for Evaluating Instruction-Following Vision-and-Language Models »
Yonatan Bitton · Hritik Bansal · Jack Hessel · Rulin Shao · Wanrong Zhu · Anas Awadalla · Josh Gardner · Rohan Taori · Ludwig Schmidt -
2023 Poster: Improving multimodal datasets with image captioning »
Thao Nguyen · Samir Yitzhak Gadre · Gabriel Ilharco · Sewoong Oh · Ludwig Schmidt -
2022 : Petals: Collaborative Inference and Fine-tuning of Large Models »
Alexander Borzunov · Dmitry Baranchuk · Tim Dettmers · Max Ryabinin · Younes Belkada · Artem Chumachenko · Pavel Samygin · Colin Raffel -
2022 : Petals: Collaborative Inference and Fine-tuning of Large Models »
Alexander Borzunov · Dmitry Baranchuk · Tim Dettmers · Max Ryabinin · Younes Belkada · Artem Chumachenko · Pavel Samygin · Colin Raffel -
2022 : 8-bit Methods for Efficient Deep Learning »
Tim Dettmers -
2022 Poster: GPT3.int8(): 8-bit Matrix Multiplication for Transformers at Scale »
Tim Dettmers · Mike Lewis · Younes Belkada · Luke Zettlemoyer -
2022 Poster: Patching open-vocabulary models by interpolating weights »
Gabriel Ilharco · Mitchell Wortsman · Samir Yitzhak Gadre · Shuran Song · Hannaneh Hajishirzi · Simon Kornblith · Ali Farhadi · Ludwig Schmidt -
2022 Poster: Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models »
Kushal Tirumala · Aram Markosyan · Luke Zettlemoyer · Armen Aghajanyan -
2022 Poster: LAION-5B: An open large-scale dataset for training next generation image-text models »
Christoph Schuhmann · Romain Beaumont · Richard Vencu · Cade Gordon · Ross Wightman · Mehdi Cherti · Theo Coombes · Aarush Katta · Clayton Mullis · Mitchell Wortsman · Patrick Schramowski · Srivatsa Kundurthy · Katherine Crowson · Ludwig Schmidt · Robert Kaczmarczyk · Jenia Jitsev -
2022 Poster: Beyond neural scaling laws: beating power law scaling via data pruning »
Ben Sorscher · Robert Geirhos · Shashank Shekhar · Surya Ganguli · Ari Morcos -
2022 Poster: Subgroup Robustness Grows On Trees: An Empirical Baseline Investigation »
Josh Gardner · Zoran Popovic · Ludwig Schmidt -
2022 Poster: Improving Policy Learning via Language Dynamics Distillation »
Victor Zhong · Jesse Mu · Luke Zettlemoyer · Edward Grefenstette · Tim Rocktäschel -
2022 Poster: Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP »
Thao Nguyen · Gabriel Ilharco · Mitchell Wortsman · Sewoong Oh · Ludwig Schmidt -
2022 Poster: Matryoshka Representation Learning »
Aditya Kusupati · Gantavya Bhatt · Aniket Rege · Matthew Wallingford · Aditya Sinha · Vivek Ramanujan · William Howard-Snyder · Kaifeng Chen · Sham Kakade · Prateek Jain · Ali Farhadi -
2021 : Panel Discussion »
Pascal Poupart · Ali Ghodsi · Luke Zettlemoyer · Sameer Singh · Kevin Duh · Yejin Choi · Lu Hou -
2021 : Toward Efficient Training of Large Language Models with Balanced Conditional Compute »
Luke Zettlemoyer -
2021 Oral: Retiring Adult: New Datasets for Fair Machine Learning »
Frances Ding · Moritz Hardt · John Miller · Ludwig Schmidt -
2021 Oral: MERLOT: Multimodal Neural Script Knowledge Models »
Rowan Zellers · Ximing Lu · Jack Hessel · Youngjae Yu · Jae Sung Park · Jize Cao · Ali Farhadi · Yejin Choi -
2021 Poster: Luna: Linear Unified Nested Attention »
Xuezhe Ma · Xiang Kong · Sinong Wang · Chunting Zhou · Jonathan May · Hao Ma · Luke Zettlemoyer -
2021 Poster: MERLOT: Multimodal Neural Script Knowledge Models »
Rowan Zellers · Ximing Lu · Jack Hessel · Youngjae Yu · Jae Sung Park · Jize Cao · Ali Farhadi · Yejin Choi -
2021 Poster: Retiring Adult: New Datasets for Fair Machine Learning »
Frances Ding · Moritz Hardt · John Miller · Ludwig Schmidt -
2021 Poster: Characterizing Generalization under Out-Of-Distribution Shifts in Deep Metric Learning »
Timo Milbich · Karsten Roth · Samarth Sinha · Ludwig Schmidt · Marzyeh Ghassemi · Bjorn Ommer -
2021 Poster: LLC: Accurate, Multi-purpose Learnt Low-dimensional Binary Codes »
Aditya Kusupati · Matthew Wallingford · Vivek Ramanujan · Raghav Somani · Jae Sung Park · Krishna Pillutla · Prateek Jain · Sham Kakade · Ali Farhadi -
2021 : Training Transformers Together »
Alexander Borzunov · Max Ryabinin · Tim Dettmers · quentin lhoest · Lucile Saulnier · Michael Diskin · Yacine Jernite · Thomas Wolf -
2021 Poster: SILG: The Multi-domain Symbolic Interactive Language Grounding Benchmark »
Victor Zhong · Austin W. Hanjie · Sida Wang · Karthik Narasimhan · Luke Zettlemoyer -
2021 Poster: Grounding inductive biases in natural images: invariance stems from variations in data »
Diane Bouchacourt · Mark Ibrahim · Ari Morcos -
2020 : Invited talk - De-noising Sequence-to-Sequence Pre-training »
Luke Zettlemoyer -
2020 Poster: Supermasks in Superposition »
Mitchell Wortsman · Vivek Ramanujan · Rosanne Liu · Aniruddha Kembhavi · Mohammad Rastegari · Jason Yosinski · Ali Farhadi -
2020 Poster: Measuring Robustness to Natural Distribution Shifts in Image Classification »
Rohan Taori · Achal Dave · Vaishaal Shankar · Nicholas Carlini · Benjamin Recht · Ludwig Schmidt -
2020 Spotlight: Measuring Robustness to Natural Distribution Shifts in Image Classification »
Rohan Taori · Achal Dave · Vaishaal Shankar · Nicholas Carlini · Benjamin Recht · Ludwig Schmidt -
2020 Poster: The Generalization-Stability Tradeoff In Neural Network Pruning »
Brian Bartoldson · Ari Morcos · Adrian Barbu · Gordon Erlebacher -
2020 Poster: Pre-training via Paraphrasing »
Mike Lewis · Marjan Ghazvininejad · Gargi Ghosh · Armen Aghajanyan · Sida Wang · Luke Zettlemoyer -
2019 : Contributed Session - Spotlight Talks »
Jonathan Frankle · David Schwab · Ari Morcos · Qianli Ma · Yao-Hung Hubert Tsai · Ruslan Salakhutdinov · YiDing Jiang · Dilip Krishnan · Hossein Mobahi · Samy Bengio · Sho Yaida · Muqiao Yang -
2019 : Lunch Break and Posters »
Xingyou Song · Elad Hoffer · Wei-Cheng Chang · Jeremy Cohen · Jyoti Islam · Yaniv Blumenfeld · Andreas Madsen · Jonathan Frankle · Sebastian Goldt · Satrajit Chatterjee · Abhishek Panigrahi · Alex Renda · Brian Bartoldson · Israel Birhane · Aristide Baratin · Niladri Chatterji · Roman Novak · Jessica Forde · YiDing Jiang · Yilun Du · Linara Adilova · Michael Kamp · Berry Weinstein · Itay Hubara · Tal Ben-Nun · Torsten Hoefler · Daniel Soudry · Hsiang-Fu Yu · Kai Zhong · Yiming Yang · Inderjit Dhillon · Jaime Carbonell · Yanqing Zhang · Dar Gilboa · Johannes Brandstetter · Alexander R Johansen · Gintare Karolina Dziugaite · Raghav Somani · Ari Morcos · Freddie Kalaitzis · Hanie Sedghi · Lechao Xiao · John Zech · Muqiao Yang · Simran Kaur · Qianli Ma · Yao-Hung Hubert Tsai · Ruslan Salakhutdinov · Sho Yaida · Zachary Lipton · Daniel Roy · Michael Carbin · Florent Krzakala · Lenka Zdeborová · Guy Gur-Ari · Ethan Dyer · Dilip Krishnan · Hossein Mobahi · Samy Bengio · Behnam Neyshabur · Praneeth Netrapalli · Kris Sankaran · Julien Cornebise · Yoshua Bengio · Vincent Michalski · Samira Ebrahimi Kahou · Md Rifat Arefin · Jiri Hron · Jaehoon Lee · Jascha Sohl-Dickstein · Samuel Schoenholz · David Schwab · Dongyu Li · Sang Choe · Henning Petzka · Ashish Verma · Zhichao Lin · Cristian Sminchisescu -
2019 Poster: Defending Against Neural Fake News »
Rowan Zellers · Ari Holtzman · Hannah Rashkin · Yonatan Bisk · Ali Farhadi · Franziska Roesner · Yejin Choi -
2019 Poster: Model Similarity Mitigates Test Set Overuse »
Horia Mania · John Miller · Ludwig Schmidt · Moritz Hardt · Benjamin Recht -
2019 Poster: Unlabeled Data Improves Adversarial Robustness »
Yair Carmon · Aditi Raghunathan · Ludwig Schmidt · John Duchi · Percy Liang -
2019 Poster: A Meta-Analysis of Overfitting in Machine Learning »
Becca Roelofs · Vaishaal Shankar · Benjamin Recht · Sara Fridovich-Keil · Moritz Hardt · John Miller · Ludwig Schmidt -
2019 Poster: Discovering Neural Wirings »
Mitchell Wortsman · Ali Farhadi · Mohammad Rastegari -
2019 Poster: One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers »
Ari Morcos · Haonan Yu · Michela Paganini · Yuandong Tian -
2018 Poster: Insights on representational similarity in neural networks with canonical correlation »
Ari Morcos · Maithra Raghu · Samy Bengio -
2017 : End-to-end Learning for Broad Coverage Semantics: SRL, Coreference, and Beyond »
Luke Zettlemoyer -
2008 Poster: Multi-Agent Filtering with Infinitely Nested Beliefs »
Luke Zettlemoyer · Brian Milch · Leslie Kaelbling -
2008 Spotlight: Multi-Agent Filtering with Infinitely Nested Beliefs »
Luke Zettlemoyer · Brian Milch · Leslie Kaelbling