Timezone: »

Workshop
Human Evaluation of Generative Models
Divyansh Kaushik · Jennifer Hsia · Jessica Huynh · Yonadav Shavit · Samuel Bowman · Ting-Hao Huang · Douwe Kiela · Zachary Lipton · Eric Michael Smith

Sat Dec 03 07:30 AM -- 02:15 PM (PST) @ Room 290

 Sat 7:30 a.m. - 7:45 a.m. Opening Remarks Divyansh Kaushik 🔗 Sat 7:45 a.m. - 8:15 a.m. Invited Keynote by Jason Weston (Keynote) Jason Weston 🔗 Sat 8:15 a.m. - 8:25 a.m. Near-Negative Distinction: Giving a Second Life to Human Evaluation Datasets (Oral)  link »    Precisely assessing the progress in natural language generation (NLG) tasks is challenging, and human evaluation is often necessary.However, human evaluation is usually costly, difficult to reproduce, and non-reusable.In this paper, we propose a new and simple automatic evaluation method for NLG called Near-Negative Distinction (NND) that repurposes prior human annotations into NND tests.To pass an NND test, an NLG model must place a higher likelihood on a high-quality output candidate than on a near-negative candidate with a known error.Model performance is established by the number of NND tests a model passes, as well as the distribution over task-specific errors the model fails on.Through experiments on three NLG tasks (question generation, question answering, and summarization), we show that NND achieves a higher correlation with human judgments than standard NLG evaluation metrics. We invite the community to adopt NND as a generic method for NLG evaluation and contribute new NND test collections. Link » Philippe Laban · Chien-Sheng Wu · Wenhao Liu · Caiming Xiong 🔗 Sat 8:25 a.m. - 8:35 a.m. Are GAN Biased? Evaluating GAN-Generated Facial Images via Crowdsourcing (Oral)  link »    Generative models produce astonishingly high-resolution and realistic facial images. However, reliably evaluating the quality of these images remains challenging, not to mention performing a systematic investigation of the potential biases in generative adversarial models (GAN). In this paper, we argue that crowdsourcing can be used to measure the biases in GAN quantitatively. We showcase an investigation that examines whether GAN-generated facial images with darker skin tones are of worse quality. We ask crowd workers to guess whether the image is real or fake, and use this as a proxy metric for estimating the quality of facial images generated by state-of-the-art GANs. The results show that GANs generate worse quality images with darker skin tones as compared to images with lighter skin tones. Link » Hangzhi Guo · Lizhen Zhu · Ting-Hao Huang 🔗 Sat 8:35 a.m. - 8:45 a.m. Towards Credible Human Evaluation of Open-Domain Dialog Systems Using Interactive Setup (Oral)  link »    Evaluating open-domain conversation models has been an open challenge due to the open-ended nature of conversations. In addition to static evaluations, recent work has started to explore a variety of per-turn and per-dialog interactive evaluation mechanisms and provide advice on the best setup. In this work, we adopt the interactive evaluation framework and further apply to multiple models with a focus on per-turn evaluation techniques. Apart from the widely used setting where participants select the best response among different candidates at each turn, one more novel per-turn evaluation setting is adopted, where participants can select all appropriate responses with different fallback strategies to continue the conversation when no response is selected. We evaluate these settings based on sensitivity and consistency using four GPT2-based models that differ in model sizes or fine-tuning data. To better generalize to any model groups with no prior assumptions on their rankings and control evaluation costs for all setups, we also propose a methodology to estimate the required sample size given a minimum performance gap of interest before running most experiments. Our comprehensive human evaluation results shed light on how to conduct credible human evaluations of open domain dialog systems using the interactive setup, and suggest additional future directions. Link » Sijia Liu · Patrick Lange · Behnam Hedayatnia · Alexandros Papangelis · Di Jin · Andrew Wirth · Yang Liu · Dilek Hakkani-Tur 🔗 Sat 8:45 a.m. - 9:30 a.m. Panel on Technical Challenges Associated with Reliable Human Evaluations of Generative Models (Discussion Panel) Long Ouyang · Tongshuang Wu · Zachary Lipton 🔗 Sat 9:30 a.m. - 11:00 a.m. Lunch Break (Break) 🔗 Sat 11:00 a.m. - 11:50 a.m. Discussion on Policy Challenges Associated with Generative Models (Discussion Panel) Irene Solaiman · Russell Wald · Yonadav Shavit · Long Ouyang 🔗 Sat 11:50 a.m. - 12:00 p.m. Human Evaluation of Text-to-Image Models on a Multi-Task Benchmark (Oral)  link »    We provide a new multi-task benchmark for evaluating text-to-image models and perform a human evaluation comparing two of the most common open source (Stable Diffusion) and commercial (DALL-E 2) models. Twenty computer science AI graduate students evaluated the two models, on three tasks, at three difficulty levels, across ten prompts each, providing 3,600 ratings. Text-to-image generation has seen rapid progress to the point that many recent models have demonstrated their ability to create realistic high-resolution images for various prompts. However, current text-to-image methods and the broader body of research in vision-language understanding still struggle with intricate text prompts that contain many objects with multiple attributes and relationships. We introduce a new text-to-image benchmark that contains a suite of fifty tasks and applications that capture a model’s ability to handle different features of a text prompt. For example, asking a model to generate a varying number of the same object to measure its ability to count or providing a text prompt with several objects that each have a different attribute to correctly identify its ability to match objects and attributes. Rather than subjectively evaluating text-to-image results on a set of prompts, our new multi-task benchmark consists of challenge tasks at three difficulty levels (easy, medium, and hard) along with human ratings for each generated image. Link » Vitali Petsiuk · Alexander E. Siemenn · Saisamrit Surbehera · Qi Qi Chin · Keith Tyser · Gregory Hunter · Arvind Raghavan · Yann Hicke · Bryan Plummer · Ori Kerret · Tonio Buonassisi · Kate Saenko · Armando Solar-Lezama · Iddo Drori 🔗 Sat 12:00 p.m. - 12:10 p.m. Can There be Art Without an Artist? (Oral)  link »    Generative Adversarial Network (GAN) based art has proliferated in the past year, going from a shiny new tool to generate fake human faces to a stage where anyone can generate thousands of artistic images with minimal effort. Some of these images are now good'' enough to win accolades from qualified judges. In this paper, we explore how Generative Models have impacted artistry, not only from a qualitative point of view, but also from an angle of exploitation of artisans --both via plagiarism, where models are trained on their artwork without permission, and via profit shifting, where profits in the art market have shifted from art creators to model owners or to traders in the unregulated secondary crypto market. This confluence of factors risks completely detaching humans from the artistic process, devaluing the labor of artists and distorting the public perception of the value of art. Link » Avijit Ghosh · Genoveva Fossas 🔗 Sat 12:10 p.m. - 12:20 p.m. Best Prompts for Text-to-Image Models and How to Find Them (Oral)  link »    Recent progress in generative models, especially in text-guided diffusion models, has enabled the production of aesthetically-pleasing imagery resembling the works of professional human artists. However, one has to carefully compose the textual description, called the prompt, and augment it with a set of clarifying keywords. Since aesthetics are challenging to evaluate computationally, human feedback is needed to determine the optimal prompt formulation and keyword combination. In this paper, we present a human-in-the-loop approach to learning the most useful combination of prompt keywords using a genetic algorithm. We also show how such an approach can improve the aesthetic appeal of images depicting the same descriptions. Link » Nikita Pavlichenko · Fedor Zhdanov · Dmitry Ustalov 🔗 Sat 12:20 p.m. - 12:30 p.m. Evaluation of Synthetic Datasets for Conversational Recommender Systems (Oral)  link »    For researchers leveraging Large-Language Models (LLMs) in the generation of training datasets, especially for conversational recommender systems - the absence of robust evaluation frameworks has been a long-standing problem. The efficiency brought about by LLMs in the data generation phase is impeded during the process of evaluation of the generated data, since it generally requires human-raters to ensure that the data generated is of high quality and has sufficient diversity. Since the quality of training data is critical for downstream applications, it is important to develop metrics that evaluate the quality holistically and identify biases. In this paper, we present a framework that takes a multi-faceted approach towards evaluating datasets produced by generative models and discuss the advantages and limitations of various evaluation methods. Link » Harsh Lara · Manoj Tiwari 🔗 Sat 12:30 p.m. - 12:45 p.m. Coffee Break (Break) 🔗 Sat 12:45 p.m. - 1:35 p.m. Panel and QnA with Science Funders Interested in Reliable Human Evaluation of Generative Models (Panel) Brittany Smith · Eric Sears · Yonadav Shavit 🔗 Sat 1:35 p.m. - 1:45 p.m. Operationalizing Specifications, In Addition to Test Sets for Evaluating Constrained Generative Models (Oral)  link »    In this work we present some recommendations on the evaluation of state of the art generative models for constrained generation tasks. The progress on generative models has been rapid in recent years. These large scale models have had three impacts: firstly, the fluency of generation in both language and vision modalities has rendered common average-case evaluation metrics much less useful in diagnosing system errors. Secondly, the user expectations around these models and their feted public releases have made the technical problem of out of domain generalization less useful. Thirdly, the same substrate models now form the basis of a number of applications, driven both by the utility of their representations as well as phenomena such as in-context learning. Consequently, our evaluation methodologies haven't adapted to these changes. More concretely, while the methods of interacting with models have seen a rise in their abstraction-level, a similar rise has not been observed in the evaluation practices. In this paper, we argue that the scale of generative models could be exploited to raise the abstraction level at which evaluation itself is conducted and provide recommendations for the same. Our recommendations are based on leveraging specifications as a powerful instrument to evaluate generation quality and are readily applicable to a variety of tasks. Link » Vikas Raunak · Matt Post · Arul Menezes 🔗 Sat 1:45 p.m. - 1:55 p.m. Sensemaking Interfaces for Human Evaluation of Language Model Outputs (Oral)  link »    Ensuring a language model doesn't generate problematic text is difficult. Traditional evaluation methods, like automatic measures or human annotation, can fail to detect all problems, whether because system designers were not aware of a kind of problem they should attempt to detect, or because an automatic measure fails to reliably detect certain kinds of problems. In this paper we propose sensemaking tools as a robust and open-ended method to evaluate the large number of linguistic outputs produced by a language model. We demonstrate one potential sensemaking interface based on concordance tables, showing that we are able to detect problematic outputs and distributional shifts in minutes, despite not knowing exactly what kind of problems to look for. Link » Katy Gero · Jonathan Kummerfeld · Elena Glassman 🔗 Sat 1:55 p.m. - 2:05 p.m. The Reasonable Effectiveness of Diverse Evaluation Data (Oral)  link »    In this paper, we present findings from an semi-experimental exploration of rater diversity and its influence on safety annotations of conversations generated by humans talking to a generative AI-chat bot. We find significant differences in judgments produced by raters from different geographic regions and annotation platforms, and correlate these perspectives with demographic sub-groups. Our work helps define best practices in model development– specifically human evaluation of generative models– on the backdrop of growing work on socio-technical evaluations. Link » Lora Aroyo · Mark Diaz · Christopher M. Homan · Vinodkumar Prabhakaran · Alex Taylor · Ding Wang 🔗 Sat 2:05 p.m. - 2:15 p.m. Closing Remarks Jessica Huynh 🔗