Timezone: »
A core part of AI alignment is training AI systems to be helpful, or more generally, to interact with humans appropriately. We look at this problem in the context of large language models. Past works have focused on training these models to perform specific tasks, or follow instructions. In contrast, we believe helpfulness requires back-and-forth interaction between the AI and the human it is trying to assist. Here, we consider a multi-step interaction in which a human asks a question, and the AI has an opportunity to ask a clarifying question to resolve ambiguities before responding. The assistance framework formalizes the idea of an AI which aims to maximize the human's reward but is ignorant of the human reward function. Past works solved toy assistance environments using exact POMDP solvers as well as deep reinforcement learning. We apply a behavioral cloning approach, and fine-tune GPT-3 such that it can respond to clear input questions directly, clarify the intent behind vague input questions, and respond based on the clarification it receives. We show that this approach leads to quantitative improvements in answer accuracy compared to a baseline that cannot ask for clarifications. While the assistance framework assumes the correct behavior of an AI is to infer and maximize a human's reward, our approach can be used to learn any interaction protocol between the AI and the human. We believe exploring interaction protocols that are easy to learn robustly, and can be used to "bootstrap" further alignment are a promising direction for future research.
Author Information
Dmitrii Krasheninnikov (University of Cambridge)
Egor Krasheninnikov (University of Cambridge)
David Krueger (University of Cambridge)
More from the Same Authors
-
2021 : Multi-Domain Balanced Sampling Improves Out-of-Distribution Generalization of Chest X-ray Pathology Prediction Models »
Enoch Tetteh · David Krueger · Joseph Paul Cohen · Yoshua Bengio -
2022 : Domain Generalization for Robust Model-Based Offline Reinforcement Learning »
Alan Clark · Shoaib Siddiqui · Robert Kirk · Usman Anwar · Stephen Chung · David Krueger -
2022 : Mechanistic Lens on Mode Connectivity »
Ekdeep S Lubana · Eric Bigelow · Robert Dick · David Krueger · Hidenori Tanaka -
2022 : Domain Generalization for Robust Model-Based Offline RL »
Alan Clark · Shoaib Siddiqui · Robert Kirk · Usman Anwar · Stephen Chung · David Krueger -
2022 : On The Fragility of Learned Reward Functions »
Lev McKinney · Yawen Duan · Adam Gleave · David Krueger -
2022 : Training Equilibria in Reinforcement Learning »
Lauro Langosco · David Krueger · Adam Gleave -
2022 : Assistance with large language models »
Dmitrii Krasheninnikov · Egor Krasheninnikov · David Krueger -
2022 : Unifying Grokking and Double Descent »
Xander Davies · Lauro Langosco · David Krueger -
2023 Poster: Thinker: Learning to Plan and Act »
Stephen Chung · Ivan Anokhin · David Krueger -
2023 Workshop: Socially Responsible Language Modelling Research (SoLaR) »
Usman Anwar · David Krueger · Samuel Bowman · Jakob Foerster · Su Lin Blodgett · Roberta Raileanu · Alan Chan · Katherine Lee · Laura Ruis · Robert Kirk · Yawen Duan · Xin Chen · Kawin Ethayarajh -
2022 Poster: Defining and Characterizing Reward Gaming »
Joar Skalse · Nikolaus Howe · Dmitrii Krasheninnikov · David Krueger -
2019 : Poster Session »
Rishav Chourasia · Yichong Xu · Corinna Cortes · Chien-Yi Chang · Yoshihiro Nagano · So Yeon Min · Benedikt Boecking · Phi Vu Tran · Kamyar Ghasemipour · Qianggang Ding · Shouvik Mani · Vikram Voleti · Rasool Fakoor · Miao Xu · Kenneth Marino · Lisa Lee · Volker Tresp · Jean-Francois Kagy · Marvin Zhang · Barnabas Poczos · Dinesh Khandelwal · Adrien Bardes · Evan Shelhamer · Jiacheng Zhu · Ziming Li · Xiaoyan Li · Dmitrii Krasheninnikov · Ruohan Wang · Mayoore Jaiswal · Emad Barsoum · Suvansh Sanjeev · Theeraphol Wattanavekin · Qizhe Xie · Sifan Wu · Yuki Yoshida · David Kanaa · Sina Khoshfetrat Pakazad · Mehdi Maasoumy