Timezone: »
Large language models produce human-like text that drive a growing number of applications. However, recent literature and, increasingly, real world observations, have demonstrated that these models can generate language that is toxic, biased, untruthful or otherwise harmful. Though work to evaluate language model harms is under way, translating foresight about which harms may arise into rigorous benchmarks is not straightforward. To facilitate this translation, we outline six ways of characterizing harmful text which merit explicit consideration when designing new benchmarks. We then use these characteristics as a lens to identify trends and gaps in existing benchmarks. Finally, we apply them in a case study of the Perspective API, a toxicity classifier that is widely used in harm benchmarks. Our characteristics provide one piece of the bridge that translates between foresight and effective evaluation.
Author Information
Maribeth Rauh (DeepMind)
John Mellor (DeepMind)
Jonathan Uesato (Google DeepMind)
Po-Sen Huang (DeepMind)
Johannes Welbl (Google)
Laura Weidinger (DeepMind)
Sumanth Dathathri (DeepMind)
Amelia Glaese (DeepMind)
Geoffrey Irving (DeepMind)
Iason Gabriel (DeepMind)
William Isaac (DeepMind)
Lisa Anne Hendricks (DeepMind)
More from the Same Authors
-
2021 Spotlight: Make Sure You're Unsure: A Framework for Verifying Probabilistic Specifications »
Leonard Berrada · Sumanth Dathathri · Krishnamurthy Dvijotham · Robert Stanforth · Rudy Bunel · Jonathan Uesato · Sven Gowal · M. Pawan Kumar -
2022 Poster: An empirical analysis of compute-optimal large language model training »
Jordan Hoffmann · Sebastian Borgeaud · Arthur Mensch · Elena Buchatskaya · Trevor Cai · Eliza Rutherford · Diego de Las Casas · Lisa Anne Hendricks · Johannes Welbl · Aidan Clark · Thomas Hennigan · Eric Noland · Katherine Millican · George van den Driessche · Bogdan Damoc · Aurelia Guy · Simon Osindero · KarĂ©n Simonyan · Erich Elsen · Oriol Vinyals · Jack Rae · Laurent Sifre -
2022 Social: Ethics Review - Open Discussion »
Deborah Raji · William Isaac · Cherie Poland · Alexandra Luccioni -
2022 Poster: Fine-tuning language models to find agreement among humans with diverse preferences »
Michiel Bakker · Martin Chadwick · Hannah Sheahan · Michael Tessler · Lucy Campbell-Gillingham · Jan Balaguer · Nat McAleese · Amelia Glaese · John Aslanides · Matt Botvinick · Christopher Summerfield -
2021 Poster: Make Sure You're Unsure: A Framework for Verifying Probabilistic Specifications »
Leonard Berrada · Sumanth Dathathri · Krishnamurthy Dvijotham · Robert Stanforth · Rudy Bunel · Jonathan Uesato · Sven Gowal · M. Pawan Kumar -
2020 : Q&A: William Isaac (DeepMind): Can Cooperative Make AI (and Society) Fairer?, with Natasha Jaques (Google) [moderator] »
William Isaac · Natasha Jaques -
2020 : Invited Speaker: William Isaac (DeepMind) on Can Cooperation make AI (and Society) Fairer? »
William Isaac -
2019 : Poster Session 1 »
Han-Hung Lee · Asir Saeed · Terence Broad · Jon Gillick · Aaron Hertzmann · Gunjan Aggarwal · Eun Jee Sung · Alex Champandard · Junghyun Park · John Mellor · Vincent Herrmann · Da Gin Wu · Seri Lee · Park Jieun · TaeJae Han · wonseok jung · Seungil Kim -
2019 : Panel Discussion »
Linda Smith · Josh Tenenbaum · Lisa Anne Hendricks · James McClelland · Timothy Lillicrap · Jesse Thomason · Jason Baldridge · Louis-Philippe Morency -
2019 : Lisa Anne Hendricks »
Lisa Anne Hendricks -
2019 Poster: Are Labels Required for Improving Adversarial Robustness? »
Jean-Baptiste Alayrac · Jonathan Uesato · Po-Sen Huang · Alhussein Fawzi · Robert Stanforth · Pushmeet Kohli -
2017 : Poster Session 1 and Lunch »
Sumanth Dathathri · Akshay Rangamani · Prakhar Sharma · Aruni RoyChowdhury · Madhu Advani · William Guss · Chulhee Yun · Corentin Hardy · Michele Alberti · Devendra Sachan · Andreas Veit · Takashi Shinozaki · Peter Chin