Oral

Language Models are Few-Shot Learners

Tom B Brown ⋅ Benjamin Mann ⋅ Nick Ryder ⋅ Melanie Subbiah ⋅ Jared Kaplan ⋅ Prafulla Dhariwal ⋅ Arvind Neelakantan ⋅ Pranav Shyam ⋅ Girish Sastry ⋅ Amanda Askell ⋅ Sandhini Agarwal ⋅ Ariel Herbert-Voss ⋅ Gretchen M Krueger ⋅ Tom Henighan ⋅ Rewon Child ⋅ Aditya Ramesh ⋅ Daniel Ziegler ⋅ Jeffrey Wu ⋅ Clemens Winter ⋅ Chris Hesse ⋅ Mark Chen ⋅ Eric Sigler ⋅ Mateusz Litwin ⋅ Scott Gray ⋅ Benjamin Chess ⋅ Jack Clark ⋅ Christopher Berner ⋅ Sam McCandlish ⋅ Alec Radford ⋅ Ilya Sutskever ⋅ Dario Amodei

Outstanding Paper

2020 Oral

Paper PDF [ Paper ]

Abstract

We demonstrate that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even becoming competitive with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks. We also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora.

Video

Chat is not available.