Timezone: »

 
Near-Negative Distinction: Giving a Second Life to Human Evaluation Datasets
Philippe Laban · Chien-Sheng Wu · Wenhao Liu · Caiming Xiong

Sat Dec 03 08:15 AM -- 08:25 AM (PST) @
Event URL: https://openreview.net/forum?id=3fCT7OAleao »

Precisely assessing the progress in natural language generation (NLG) tasks is challenging, and human evaluation is often necessary.However, human evaluation is usually costly, difficult to reproduce, and non-reusable.In this paper, we propose a new and simple automatic evaluation method for NLG called Near-Negative Distinction (NND) that repurposes prior human annotations into NND tests.To pass an NND test, an NLG model must place a higher likelihood on a high-quality output candidate than on a near-negative candidate with a known error.Model performance is established by the number of NND tests a model passes, as well as the distribution over task-specific errors the model fails on.Through experiments on three NLG tasks (question generation, question answering, and summarization), we show that NND achieves a higher correlation with human judgments than standard NLG evaluation metrics. We invite the community to adopt NND as a generic method for NLG evaluation and contribute new NND test collections.

Author Information

Philippe Laban (Salesforce.com)
Philippe Laban

Philippe is Research Scientist at Salesforce Research, New York. Previously he completed his Ph.D. in Computer Science at UC Berkeley, advised by Marti Hearst and John Canny.

Chien-Sheng Wu (Salesforce Research)
Wenhao Liu (Salesforce inc.)
Caiming Xiong (Salesforce Research)

More from the Same Authors