Interactive machine learning studies algorithms that learn from data collected through interaction with either a computational or human agent in a shared environment, through feedback on model decisions. In contrast to the common paradigm of supervised learning, IML does not assume access to pre-collected labeled data, thereby decreasing data costs. Instead, it allows systems to improve over time, empowering non-expert users to provide feedback. IML has seen wide success in areas such as video games and recommendation systems.
Although most downstream applications of NLP involve interactions with humans - e.g., via labels, demonstrations, corrections, or evaluation - common NLP models are not built to learn from or adapt to users through interaction. There remains a large research gap that must be closed to enable NLP systems that adapt on-the-fly to the changing needs of humans and dynamic environments through interaction.
Sat 7:00 a.m. - 7:05 a.m.
|
Opening Remarks
(
Introduction
)
SlidesLive Video » |
🔗 |
Sat 7:05 a.m. - 7:35 a.m.
|
Karthik Narasimhan: Semantic Supervision for few-shot generalization and personalization
(
Invited Talk
)
SlidesLive Video » A desirable feature of interactive NLP systems is the ability to receive feedback from humans and personalize to new users. Existing paradigms encounter challenges in acquiring new concepts due to the use of discrete labels and scalar rewards. As one solution to alleviate this problem, I will present our work on Semantic Supervision (SemSUP), which trains models to predict over multiple natural language descriptions of classes (or even structured ones like JSON). SemSUP can seamlessly replace any standard supervised learning setup without sacrificing any in-distribution accuracy, while providing generalization to unseen concepts and scalability to large label spaces. |
Karthik Narasimhan 🔗 |
Sat 7:35 a.m. - 8:05 a.m.
|
John Langford
(
Invited Talk
)
SlidesLive Video » |
John Langford 🔗 |
Sat 8:05 a.m. - 8:35 a.m.
|
Coffee Break
|
🔗 |
Sat 8:35 a.m. - 8:50 a.m.
|
Is Reinforcement Learning (Not) for Natural Language Processing?: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization
(
Contributed Talk
)
SlidesLive Video » We tackle the problem of aligning pre-trained large language models (LMs) with human preferences. If we view text generation as a sequential decision-making problem, reinforcement learning (RL) appears to be a natural conceptual framework. However, using RL for LM-based generation faces empirical challenges, including training instability due to the combinatorial action space, as well as a lack of open-source libraries and benchmarks customized for LM alignment. Thus, a question rises in the research community: is RL a practical paradigm for NLP? To help answer this, we first introduce an open-source modular library, (Reinforcement Learning for Language Models), for optimizing language generators with RL. The library consists of on-policy RL algorithms that can be used to train any encoder or encoder-decoder LM in the HuggingFace library (Wolf et al. 2020) with an arbitrary reward function. Next, we present the (General Reinforced-language Understanding Evaluation) benchmark, a set of 6 language generation tasks which are supervised not by target strings, but by reward functions which capture automated measures of human preference. GRUE is the first leaderboard-style evaluation of RL algorithms for NLP tasks. Finally, we introduce an easy-to-use, performant RL algorithm, Natural Language Policy Optimization that learns to effectively reduce the combinatorial action space in language generation. We show 1) that RL techniques are generally better than supervised methods at aligning LMs to human preferences; and 2) that NLPO exhibits greater stability and performance than previous policy gradient methods (e.g., PPO (Schulman et al. 2017)), based on both automatic and human evaluation. |
Prithviraj Ammanabrolu 🔗 |
Sat 8:50 a.m. - 9:05 a.m.
|
WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents
(
Contributed Talk
)
SlidesLive Video » Existing benchmarks for grounding language in interactive environments either lack real-world linguistic elements, or prove difficult to scale up due to substantial human involvement in the collection of data or feedback signals. To bridge this gap, we develop WebShop – a simulated e-commerce website environment with 1.18 million real-world products and 12, 087 crowd-sourced text instructions. Given a text instruction specifying a product requirement, an agent needs to navigate multiple types of webpages and issue diverse actions to find, customize, and purchase an item. WebShop provides several challenges for language grounding including understanding compositional instructions, query (re-)formulation, comprehending and acting on noisy text in webpages, and performing strategic exploration. |
Shunyu Yao 🔗 |
Sat 9:05 a.m. - 9:35 a.m.
|
Dan Weld: From Advice Taking to Active Learning
(
Invited Talk
)
SlidesLive Video » |
Daniel Weld 🔗 |
Sat 9:35 a.m. - 10:05 a.m.
|
Qian Yang
(
Invited Talk
)
|
Qian Yang 🔗 |
Sat 10:05 a.m. - 11:05 a.m.
|
Lunch Break
|
🔗 |
Sat 11:05 a.m. - 12:05 p.m.
|
Poster sessions
|
🔗 |
Sat 12:05 p.m. - 12:20 p.m.
|
InterFair: Debiasing with Natural Language Feedback for Fair Interpretable Predictions
(
Contributed Talk
)
SlidesLive Video » Debiasing methods in NLP models traditionally focus on isolating information related to a sensitive attribute (like gender or race). We rather argue that a favorable debiasing method should use sensitive information 'fairly,' with explanations, rather than blindly eliminating it. This fair balance is often subjective and can be challenging to achieve algorithmically. We show that an interactive setup with users enabled to provide feedback can achieve a better and fair balance between task performance and bias mitigation, supported by faithful explanations. |
Zexue He 🔗 |
Sat 12:20 p.m. - 12:35 p.m.
|
Error Detection for Interactive Text-to-SQL Semantic Parsing
(
Contributed Talk
)
SlidesLive Video » Despite remarkable progress in Text-to-SQL semantic parsing, the performance of state-of-the-art parsers are still far from perfect. At the same time, modern deep learning based Text-to-SQL parsers are often over-confident and thus casting doubt on its trustworthiness when used in an interactive setting. In this paper, we propose to train parser-agnostic error detectors for Text-to-SQL semantic parsers. We test our proposed approach with SmBop and show our model could outperform parser-dependent uncertainty measures in simulated interactive evaluations. As a result, when used for answer triggering or interaction trigger in interactive semantic parsing systems, our model could effectively improve the usability of the base parser. |
Shijie Chen 🔗 |
Sat 12:35 p.m. - 1:05 p.m.
|
Anca Dragan: Learning human preferences from language
(
Invited Talk
)
SlidesLive Video » In classic instruction following, language like "I'd like the JetBlue flight" maps to actions (e.g., selecting that flight). However, language also conveys information about a user's underlying reward function (e.g., a general preference for JetBlue), which can allow a model to carry out desirable actions in new contexts. In this talk, I'll share a model that infers rewards from language pragmatically: reasoning about how speakers choose utterances not only to elicit desired actions, but also to reveal information about their preferences. |
Anca Dragan 🔗 |
Sat 1:05 p.m. - 1:35 p.m.
|
Coffee Break
|
🔗 |
Sat 1:35 p.m. - 2:05 p.m.
|
Aida Nematzadeh: On Evaluating Neural Representations
(
Invited Talk
)
SlidesLive Video » There has been an increased interest in developing general-purpose pretrained models across different domains, such as language, vision, and multimodal. This approach is appealing because we can pretrain models on large datasets once, and then adapt them to various tasks using a smaller supervised dataset. Moreover, these models achieve impressive results on a range of benchmarks, often performing better than task-specific models. Finally, this pretraining approach processes the data passively and does not rely on actively interacting with humans. In this talk, I will first discuss what aspects of language children can learn passively and to what extent interacting with others might require developing theory of mind. Next, I discuss the need for better evaluation pipelines to better understand the shortcomings and strengths of pretrained models. In particular, I will talk about: (1) the necessity of directly measuring real-world performance (as opposed to relying on benchmark performance), (2) the importance of strong baselines, and (3) how to design probing dataset to measure certain capabilities of our models. I will focus on commonsense reasoning, verb understanding, and theory of mind as challenging domains for our existing pretrained models. |
Aida Nematzadeh 🔗 |
Sat 2:05 p.m. - 2:50 p.m.
|
Panel Discussion
(
Panel
)
SlidesLive Video » |
🔗 |
Sat 2:50 p.m. - 2:55 p.m.
|
Closing Remarks
(
Closing
)
|
🔗 |