Timezone: »

 
Fantastic Rewards and How to Tame Them: A Case Study on Reward Learning for Task-Oriented Dialogue Systems
Yihao Feng · Shentao Yang · Shujian Zhang · Jianguo Zhang · Caiming Xiong · Mingyuan Zhou · Huan Wang
Event URL: https://openreview.net/forum?id=pHpEfbrkJEk »

When learning task-oriented dialogue (TOD) agents, one can naturally utilize reinforcement learning (RL) techniques to train conversational strategies to achieve user-specific goals. Existing works on training TOD agents mainly focus on developing advanced RL algorithms, while the mechanical designs of reward functions are not well studied. This paper discusses how we can better learn and utilize reward functions for training TOD agents. Specifically, we propose two generalized objectives for reward function learning inspired by the classical learning to rank losses. Further, to address the high variance issue of policy gradient estimation using REINFORCE, we leverage the gumbel-softmax trick to better estimate the gradient for TOD policies, which significantly improves the training stability for policy learning. With the above techniques, we can outperform the state-of-the-art results on the end-to-end dialogue task on the Multiwoz 2.0 dataset.

Author Information

Yihao Feng (Salesforce Research)

Researcher from Salesforce Research

Shentao Yang (The University of Texas at Austin)
Shujian Zhang (UT Austin)
Jianguo Zhang (Salesforce AI Research)
Caiming Xiong (Salesforce Research)
Mingyuan Zhou (University of Texas at Austin)
Huan Wang (Salesforce Research)

More from the Same Authors