Timezone: »

Workshop
Human in the Loop Learning (HiLL) Workshop at NeurIPS 2022
Shanghang Zhang · Hao Dong · Wei Pan · Pradeep Ravikumar · Vittorio Ferrari · Fisher Yu · Xin Wang · Zihan Ding

Fri Dec 02 06:30 AM -- 03:00 PM (PST) @ Room 396

Recent years have witnessed the rising need for machine learning systems that can interact with humans in the learning loop. Such systems can be applied to computer vision, natural language processing, robotics, and human-computer interaction. Creating and running such systems call for interdisciplinary research of artificial intelligence, machine learning, and software engineering design, which we abstract as Human in the Loop Learning (HiLL).

The HiLL workshop aims to bring together researchers and practitioners working on the broad areas of HiLL, ranging from interactive/active learning algorithms for real-world decision-making systems (e.g., autonomous driving vehicles, robotic systems, etc.), human-inspired learning that mitigates the gap between human intelligence and machine intelligence, human-machine collaborative learning that creates a more powerful learning system, lifelong learning that transfers knowledge to learn new tasks over a lifetime, as well as interactive system designs (e.g., data visualization, annotation systems, etc.).

The HiLL workshop continues the previous effort to provide a platform for researchers from interdisciplinary areas to share their recent research. In this year’s workshop, a special feature is to encourage the discussion on the interactive and collaborative learning between human and machine learning agents: Can they be organically combined to create a more powerful learning system? We believe the theme of the workshop will be of interest to broad NeurIPS attendees, especially those who are interested in interdisciplinary study.

 Fri 6:30 a.m. - 7:00 a.m. Openning Remark 🔗 Fri 7:00 a.m. - 7:30 a.m. Interactive Imitation Learning in Robotics (Invited Talk)  link » Jens Kober 🔗 Fri 7:30 a.m. - 8:00 a.m. What to learn from humans? (Invited Talk)  link » Danica Kragic 🔗 Fri 8:00 a.m. - 8:30 a.m. Human in the Loop Learning for Robot Navigation and Task Learning from Implicit Human Feedback (Invited Talk)  link » Peter Stone 🔗 Fri 8:30 a.m. - 9:00 a.m. Language models and interactive decision-making (Invited Talk)  link » Igor Mordatch 🔗 Fri 9:00 a.m. - 9:30 a.m. Collaborative AI for assisting virtual laboratories (invited Talk)  link » I will discuss two ideas: (1) virtual laboratories for science and R&D, aiming to introduce an interface between algorithms and domain research that enables AI-driven scale advantages, and (2) AI-based ‘sidekick’ assistants. The purpose of the assistants is to help other agents reach their goals, even when they are not yet able to specify the goal explicitly or it is evolving. Such assistants can help with prior knowledge elicitation, at the simplest, and zero-shot assistance as the worst case. Ultimately they should be helpful for human domain experts in running experiments and solving research problems in virtual laboratories. I invite researchers to join the virtual laboratory movement: domain scientists by hosting a virtual laboratory in their field, methods researchers by contributing new methods to virtual laboratories, and human-in-the-loop ML researchers by developing the assistants. Link » Samuel Kaski 🔗 Fri 9:30 a.m. - 10:00 a.m. Let’s Give Domain Experts a Choice by Creating Many Approximately-Optimal Machine Learning Models (Invited Talk)  link » Cynthia Rudin 🔗 Fri 10:00 a.m. - 11:00 a.m. Poster 🔗 Fri 11:00 a.m. - 11:10 a.m. Human Interventions in Concept Graph Networks (Contributed Talk) 🔗 Fri 11:10 a.m. - 11:20 a.m. Nano: Nested Human-in-the-Loop Reward Learning for Controlling Distribution of Generated Text (Contributed Talk) 🔗 Fri 11:20 a.m. - 11:30 a.m. Differentiable User Models (Contributed Talk) 🔗 Fri 11:30 a.m. - 12:00 p.m. TBD- (Invited Talk)  link » Dan Bohus 🔗 Fri 12:00 p.m. - 12:30 p.m. TBD (Invited Talk)  link » Brenna Argall 🔗 Fri 12:30 p.m. - 1:00 p.m. Aligning Humans and Robots: Active Elicitation of Informative and Compatible Queries (Invited Talk)  link » Dorsa Sadigh 🔗 Fri 1:00 p.m. - 1:30 p.m. Imitation, Innovation and Caregiving in Children and AI (Invited Talk)  link » Alison Gopnik 🔗 Fri 1:30 p.m. - 2:00 p.m. Long-Tailed High-Stakes Human-Machine Interaction (Invited Talk)  link » DING ZHAO 🔗 Fri 2:00 p.m. - 3:00 p.m. Panel discussion (Panel) 🔗 Fri 3:00 p.m. - Poster 🔗 - Modeling Semantic Correlation and Hierarchy for Real-world Wildlife Recognition (Poster) In wildlife imagery, the main challenges for a model to assist human annotation are two-fold: (1) the training dataset is usually imbalanced, which makes the model's suggestion biased, and (2) there are complex taxonomies in the classes. We establish a simple and efficient baseline, including the debiasing loss function and the hyperbolic network architecture, to address these issues and achieve noticeable improvements in image classification accuracy compared to a naive method. Moreover, we propose leveraging the semantic correlation to train the model more effectively by adding a co-occurrence layer to our model during training. The proposed semantic correlation-based learning method significantly improves the performance. We demonstrate the efficacy of our method in both our real-world wildlife areal survey recognition dataset and the public image classification dataset, CIFAR100-LT and CIFAR10-LT. Dong-Jin Kim · Zhongqi Miao · Yunhui Guo · Stella Yu · Kyle Landolt · Mark Koneff · Travis Harrison 🔗 - Making Your First Choice: To Address Cold Start Problem in Vision Active Learning (Poster) []  Active learning promises to improve annotation efficiency by iteratively selecting the most important data to be annotated first. However, we uncover a striking contradiction to this promise: active learning fails to select data as efficiently as random selection at the first few choices. We identify this as the cold start problem in vision active learning, caused by a biased and outlier initial query. This paper seeks to address the cold start problem by exploiting the three advantages of contrastive learning: (1) no annotation is required; (2) label diversity is ensured by pseudo-labels to mitigate bias; (3) typical data is determined by contrastive features to reduce outliers. Experiments are conducted on CIFAR-10-LT and three medical imaging datasets (i.e. Colon Pathology, Abdominal CT, and Blood Cell Microscope). Our initial query not only significantly outperforms existing active querying strategies but also surpasses random selection by a large margin. We foresee our solution to the cold start problem as a simple yet strong baseline to choose the initial query for vision active learning. Code is available: https://github.com/c-liangyu/CSVAL Liangyu Chen · Yutong Bai · Siyu Huang · Yongyi Lu · Bihan Wen · Alan Yuille · Zongwei Zhou 🔗 - Rewards Encoding Environment Dynamics Improves Preference-based Reinforcement Learning (Poster) Preference-based reinforcement learning (RL) algorithms help avoid the pitfalls of hand-crafted reward functions by distilling them from human preference feedback, but they remain impractical due to the burdensome number of labels required from the human, even for relatively simple tasks. In this work, we demonstrate that encoding environment dynamics in the reward function (REED) dramatically reduces the number of preference labels required in state-of-the-art preference-based RL frameworks. We hypothesize that REED-based methods better partition the state-action space and facilitate generalization to state-action pairs not included in the preference dataset. REED iterates between encoding environment dynamics in a state-action representation via a self-supervised temporal consistency task, and bootstrapping the preference-based reward function from the state-action representation. Whereas prior approaches train only on the preference-labelled trajectory pairs, REED exposes the state-action representation to all transitions experienced during policy training. We explore the benefits of REED within the PrefPPO \citep{christiano2017deep} and PEBBLE \citep{lee2021pebble} preference learning frameworks and demonstrate improvements across experimental conditions to both the speed of policy learning and the final policy performance. For example, on quadruped-walk and walker-walk with 50 preference labels, REED-based reward functions recover 83\% and 66\% of ground truth reward policy performance and without REED only 38\% and 21\% are recovered. For some domains, REED-based reward functions result in policies that outperform policies trained on the ground truth reward. Katherine Metcalf · Miguel Sarabia · Barry-John Theobald 🔗 - (When) Are Contrastive Explanations of Reinforcement Learning Helpful? (Poster) Global explanations of a reinforcement learning (RL) agent's expected behavior can make it safer to deploy. However, such explanations are often difficult to understand because of the complicated nature of many RL policies. Effective human explanations are often contrastive, referencing a known contrast (policy) to reduce redundancy. At the same time, these explanations also require the additional effort of referencing that contrast when evaluating an explanation. We conduct a user study to understand whether and when contrastive explanations might be preferable to complete explanations that do not require referencing a contrast. We find that complete explanations are generally more effective when they are the same size or smaller than a contrastive explanation of the same policy, and no worse when they are larger. This suggests that contrastive explanations are not sufficient to solve the problem of effectively explaining reinforcement learning policies, and require additional careful study for use in this context. Sanjana Narayanan · Isaac Lage · Finale Doshi-Velez 🔗 - Human Interventions in Concept Graph Networks (Poster) []  Deploying Graph Neural Networks requires trustworthy models whose interpretable structure and reasoning can support effective human interactions and model checking. Existing explainers fail to address this issue by providing post-hoc explanations which do not allow human interaction making the model itself more interpretable. To fill this gap, we introduce the Concept Distillation Module, the first differentiable concept-distillation approach for graph networks. The proposed approach is a layer that can be plugged into any graph network to make it explainable by design, by first distilling graph concepts from the latent space and then using these to solve the task. Our results demonstrate that this approach allows graph networks to: (i) support effective human interventions at test time: these can increase human trust as well as significantly improve model performance, (ii) provide high-quality concept-based logic explanations for their prediction, and (iii) attain model accuracy comparable with their equivalent vanilla versions. Lucie Charlotte Magister · Pietro Barbiero · Dmitry Kazhdan · Federico Siciliano · Gabriele Ciravegna · Fabrizio Silvestri · Mateja Jamnik · Pietro Lió 🔗 - Identifying the Context Shift between Test Benchmarks and Production Data (Poster) Machine learning models are often brittle on production data despite achieving high accuracy on benchmark datasets. Benchmark datasets have traditionally served dual purposes: first, benchmarks offer a standard on which machine learning researchers can compare different methods, and second, benchmarks provide a model, albeit imperfect, of the real world. The incompleteness of test benchmarks (and the data upon which models are trained) hinder robustness in machine learning, enable shortcut learning, and leave models systematically prone to err on out-of-distribution and adversarially perturbed data. The mismatch between a single static benchmark dataset and a production dataset has traditionally been described as a dataset shift. In an effort to clarify how to address the mismatch between test benchmarks and production data, we introduce context shift to describe semantically meaningful changes in the underlying data generation process. Moreover, we identify three methods for addressing context shift that would otherwise lead to model prediction errors: first, we describe how human intuition and expert knowledge can identify semantically meaningful features upon which models systematically fail, second, we detail how dynamic benchmarking - with its focus on capturing the data generation process - can promote generalizability through corroboration, and third, we highlight that clarifying a model's limitations can reduce unexpected errors. Robust machine learning is focused on model performance beyond benchmarks, and as such, we consider three model organism domains – facial expression recognition, deepfake detection, and medical diagnosis – to highlight how implicit assumptions in benchmark tasks lead to errors in practice. By paying close attention to the role of context, researchers can design more comprehensive benchmarks, reduce context shift errors, and increase generalizability. Matt Groh 🔗 - Continually Learned Pavlovian Signalling Without Forgetting for Human-in-the-Loop Robotic Control (Poster) []  Artificial limbs are sophisticated devices to assist people with tasks of daily living. Despite advanced robotic prostheses demonstrating similar motion capabilities to biological limbs, users report them difficult and non-intuitive to use. Providing more effective feedback from the device to the user has therefore become a topic of increased interest. In particular, prediction learning methods from the field of reinforcement learning---specifically, an approach termed Pavlovian signalling---have been proposed as one approach for better modulating feedback in prostheses since they can adapt during continuous use. One challenge identified in these learning methods is that they can forget previously learned predictions when a user begins to successfully act upon delivered feedback. The present work directly addresses this challenge, contributing new evidence on the impact of algorithmic choices, such as on- or off-policy methods and representation choices, on the Pavlovian signalling from a machine to a user during their control of a robotic arm. Two conditions of algorithmic differences were studied using different scenarios of controlling a robotic arm: an automated motion system and human participant piloting. Contrary to expectations, off-policy learning did not provide the expected solution to the forgetting problem. We instead identified beneficial properties of a look-ahead state representation that made existing approaches able to learn (and not forget) predictions in support of Pavlovian signalling. This work therefore contributes new insight into the challenges of providing learned predictive feedback from a prosthetic device, and demonstrates avenues for more dynamic signalling in future human-machine interactions. Adam Parker · Michael Dawson · Patrick M Pilarski 🔗 - MultiViz: Towards Visualizing and Understanding Multimodal Models (Poster) The promise of multimodal models for real-world applications has inspired research in visualizing and understanding their internal mechanics with the end goal of empowering stakeholders to visualize model behavior, perform model debugging, and promote trust in machine learning models. However, modern multimodal models are typically black-box neural networks, which makes it challenging to understand their internal mechanics. How can we visualize the internal modeling of multimodal interactions in these models? Our paper aims to fill this gap by proposing MultiViz, a method for analyzing the behavior of multimodal models by scaffolding the problem of interpretability into 4 stages: (1) unimodal importance: how each modality contributes towards downstream modeling and prediction, (2) cross-modal interactions: how different modalities relate with each other, (3) multimodal representations: how unimodal and cross-modal interactions are represented in decision-level features, and (4) multimodal prediction: how decision-level features are composed to make a prediction. MultiViz is designed to operate on diverse modalities, models, tasks, and research areas. Through experiments on 8 trained models across 6 real-world tasks, we show that the complementary stages in MultiViz together enable users to (1) simulate model predictions, (2) assign interpretable concepts to features, (3) perform error analysis on model misclassifications, and (4) use insights from error analysis to debug models. MultiViz is publicly available, will be regularly updated with new interpretation tools and metrics, and welcomes inputs from the community. Paul Pu Liang · · Gunjan Chhablani · Nihal Jain · Zihao Deng · Xingbo Wang · Louis-Philippe Morency · Ruslan Salakhutdinov 🔗 - Environment Design for Inverse Reinforcement Learning (Poster) []  The task of learning a reward function from expert demonstrations suffers from high sample complexity as well as inherent limitations to what can be learned from demonstrations in a given environment. As the samples used for reward learning require human input, which is generally expensive, much effort has been dedicated towards designing more sample-efficient algorithms. Moreover, even with abundant data, current methods can still fail to learn insightful reward functions that are robust to minor changes in the environment dynamics. We approach these challenges differently than prior work by improving the sample-efficiency as well as the robustness of learned rewards through adaptively designing a sequence of demonstration environments for the expert to act in. We formalise a framework for this environment design process, in which learner and expert repeatedly interact, and construct algorithms that actively seek information about the rewards by carefully curating environments for the human to demonstrate the task in. Thomas Kleine Buening · Christos Dimitrakakis 🔗 - Improving Named Entity Recognition in Telephone Conversations via Effective Active Learning with Human in the Loop (Poster) []  Telephone transcription data can be very noisy due to speech recognition errors,disfluencies, etc. Not only that annotating such data is very challenging for theannotators, but also such data may have lots of annotation errors even after theannotation job is completed, resulting in a very poor model performance. In thispaper, we present an active learning framework that leverages human in the looplearning to identify data samples from the annotated dataset for re-annotation thatare more likely to contain annotation errors. In this way, we largely reduce the needof data re-annotation for the whole dataset. We conduct extensive experimentswith our proposed approach for Named Entity Recognition and observe that byre-annotating only about 6% training instances out of the whole dataset, the F1score for a certain entity type can be significantly improved by about 25%. Md Tahmid Rahman Laskar · Cheng Chen · Xue-Yong Fu · Shashi Bhushan 🔗 - Interactive Medical Image Segmentation with Self-Adaptive Confidence Calibration (Poster) Interactive medical segmentation based on human-in-the-loop is a novel paradigm that draws on human expert knowledge to assist medical image segmentation. However, existing methods often fall into what we call the \textit{interactive misunderstanding}, the essence of which is the dilemma in trade-off \textit{short-} and \textit{long-term} interaction information. To better utilize the interactive information at various timescales, we propose an interactive segmentation framework, called interactive {\bf{ME}}dical segmentation with self-adaptive {\bf{C}}onfidence {\bf{CA}}libration ({\bf{MECCA}}), which combines the action-based confidence learning and multi-agent reinforcement learning. A novel confidence network is learned by predicting the alignment level of the action with the short-term interactive information. A confidence-based reward shaping mechanism is then proposed to explicitly incorporate the confidence into the policy gradient calculation, thus directly correcting the model's interactive misunderstanding. Furthermore, MECCA also enables user-friendly interactions by reducing the interaction intensity and difficulty via label generation and interaction guidance, respectively. Numerical experiments on different segmentation tasks show that MECCA can significantly improve short- and long-term interactive information utilization efficiency with remarkably fewer labeled samples. The demo video is available at \url{https://bit.ly/mecca-demo-video}. Wenhao Li · Chuyun Shen · Qisen Xu · Bin Hu · · Haibin Cai · Fengping Zhu · Yuxin Li · Xiangfeng Wang 🔗 - Generating Personalized Counterfactual Interventions for Algorithmic Recourse by Eliciting User Preferences (Poster) Counterfactual interventions are a powerful tool to explain the decisions of a black-box decision process, and to enable algorithmic recourse. They are a sequence of actions that, if performed by a user, can overturn an unfavourable decision made by an automated decision system. However, most of the current methods provide interventions without considering the user's preferences. For example, a user might prefer doing certain actions with respect to others. In this work, we present the first human-in-the-loop approach to perform algorithmic recourse by eliciting user preferences. We introduce a polynomial procedure to ask choice-set questions which maximize the Expected Utility of Selection (EUS), and use it to iteratively refine our cost estimates in a Bayesian setting. We integrate this preference elicitation strategy into a reinforcement learning agent coupled with Monte Carlo Tree Search for efficient exploration, so as to provide personalized interventions achieving algorithmic recourse. An experimental evaluation on synthetic and real-world datasets shows that a handful of queries allows to achieve a substantial reduction in the cost of interventions with respect to user-independent alternatives. Giovanni De Toni · Paolo Viappiani · Bruno Lepri · Andrea Passerini 🔗 - Description2Font: Font Generation via Style Description (Poster) Typeface design plays a vital role in graphic and communication design. Different fonts suit different scenarios and can express different emotions and messages. Font design still requires the participation of professional designers who can create individual font styles for particular requirements. There has also been some use of generative adversarial networks (GANs) for font generation. However, the annotation requirements of the font generation dataset are high and hard to acquire; the machine-generated font cannot meet the designer’s requirements. Therefore, the dataset annotations restrict the generated font variance. Based on the observation of current font generation models, we propose an easy solution for the font generation task. Instead of using attributes annotated by the dataset to represent the font style vector, we introduce the transformer-based language pre-training model into the font generation task, to learn the mapping between the font style description and the font style vector. We evaluated the proposed font generation model based on existing font style descriptions and the newly created font style descriptions. The generated fonts show that the proposed model can generate quality and patent-free fonts based on the input style description required from designer. Pan Wang · Xun Zhang · Peter Childs · Kunpyo Lee · Stephen Jia WANG 🔗 - Nano: Nested Human-in-the-Loop Reward Learning for Controlling Distribution of Generated Text (Poster) Pretrained language models have demonstrated extraordinary capabilities in language generation. However, real-world tasks often require controlling the distribution of generated text in order to mitigate bias, promote fairness, and achieve personalization. Existing techniques for controlling the distribution of generated text only work with quantified distributions, which require pre-defined categories, proportions of the distribution, or an existing corpus following the desired distributions. However, many important distributions, such as personal preferences, are unquantified. In this work, we tackle the problem of generating text following arbitrary distributions (quantified and unquantified) by proposing Nano, a few-shot human-in-the-loop training algorithm that continuously learns from human feedback. Nano achieves state-of-the-art results on single topic/attribute as well as quantified distribution control compared to previous works. We also show that Nano is able to learn unquantified distributions, achieves personalization and captures differences between different individuals' personal preferences with high sample efficiency. Xiang Fan · · Paul Pu Liang · Ruslan Salakhutdinov · Louis-Philippe Morency 🔗 - Towards Informed Design and Validation Assistance in Computer Games Using Imitation Learning (Poster) []  In games, as in many other domains, design validation and testing is a significant challenge as systems are growing in size and manual testing is becoming infeasible. This paper proposes a new approach to automated game validation. Our method leverages a data-driven imitation learning technique, which requires little effort and time and no knowledge of machine learning or programming, that designers can use to efficiently train game testing agents. We investigate the validity of our approach through a user study with industry experts. The survey results show that ours is indeed a valid approach to game validation and that data-driven programming would be a useful aid to reducing effort and increasing quality of modern playtesting. The survey also highlights several open challenges. With the help of the most recent literature, we analyze the identified challenges and propose future research directions suitable for maximizing the utility of our approach. Alessandro Sestini · Carl Joakim Bergdahl · Konrad Tollmar · Andrew Bagdanov · Linus Gisslén 🔗 - Identifying Spurious Correlations and Correcting them with an Explanation-based Learning (Poster) []  Identifying spurious correlations learned by a trained model is at the core of refining a trained model and building a trustworthy model. We present a simple method to identify spurious correlations that have been learned by a model trained for image classification problems. We apply image-level perturbations and monitor changes in certainties of predictions made using the trained model. We demonstrate this approach using an image classification dataset that contains images with synthetically generated spurious regions and show that the trained model was overdependent on spurious regions. Moreover, we remove the learned spurious correlations with an explanation based learning approach. Misgina Tsighe Hagos · Kathleen Curran · Brian Mac Namee 🔗 - "I pick you choose": Joint human-algorithm decision making in multi-armed bandits (Poster) Online learning in multi-armed bandits has been a rich area of research for decades, resulting in numerous \enquote{no-regret} algorithms that efficiently learn the arm with highest expected reward. However, in many settings the final decision of which arm to pull isn't under the control of the algorithm itself. For example, a driving app typically suggests a subset of routes (arms) to the driver, who ultimately makes the final choice about which to select. Typically, the human also wishes to learn the optimal arm based on historical reward information, but decides which arm to pull based on a potentially different objective function, such as being more or less myopic about exploiting near-term rewards. In this paper, we show when this joint human-algorithm system can achieve good performance. Specifically, we explore multiple possible frameworks for human objectives and give theoretical regret bounds for regret. Finally, we include experimental results exploring how regret varies with the human decision-maker's objective, as well as the number of arms presented. Kate Donahue · Sreenivas Gollapudi · Kostas Kollias 🔗 - Active metric learning and classification using similarity queries (Poster) Active learning is commonly used to train label-efficient models by adaptively selecting the most informative queries. Most active learning strategies are designed to either learn a representation of the data (e.g., embedding or metric learning) or perform well on a task (e.g., classification) on the data. However, many machine learning tasks involve a combination of both representation learning and a task-specific goal. Motivated by this, we propose a novel unified query framework that can be applied to any problem in which a key component is learning a representation of the data that reflects similarity. Our approach builds on nearest neighbor (NN) queries which seek to select samples that result in improved embeddings. The queries consist of a reference and a set of objects, with an oracle selecting the object most similar (i.e., nearest) to the reference. In order to reduce the number of solicited queries, they are chosen adaptively according to an information theoretic criterion. We demonstrate the effectiveness of the proposed strategy on two tasks -- active metric learning and active classification -- using a variety of synthetic and real world datasets. In particular, we demonstrate that actively selected NN queries outperform recently developed active triplet selection methods in a deep metric learning setting. Further, we show that in classification, actively selecting class labels can be reformulated as a process of selecting the most informative NN query, allowing direct application of our method. Namrata Nadagouda · Austin Xu · Mark Davenport 🔗 - Efficient Preference-Based Reinforcement Learning Using Learned Dynamics Models (Poster) Preference-based reinforcement learning (PbRL) can enable robots to learn to perform tasks based on an individual's preferences without requiring a hand-crafted reward function. However, existing approaches either assume access to a high-fidelity simulator or analytic model or take a model-free approach that requires extensive, possibly unsafe online environment interactions. In this paper, we study the benefits and challenges of using a learned dynamics model when performing PbRL. In particular, we provide evidence that a learned dynamics model offers the following benefits when performing PbRL: (1) preference elicitation and policy optimization require significantly fewer environment interactions than model-free PbRL, (2) diverse preference queries can be synthesized safely and efficiently as a byproduct of standard model-based RL, and (3) reward pretraining based on suboptimal demonstrations can be performed without any environmental interaction. Our paper provides empirical evidence that learned dynamics models enable robots to learn customized policies based on user preferences in ways that are safer and more sample efficient than prior preference learning approaches. Yi Liu · Gaurav Datta · Ellen Novoseller · Daniel Brown 🔗 - Utilizing supervised models to infer consensus labels and their quality from data with multiple annotators (Poster) Real-world data for classification is often labeled by multiple annotators. For analyzing such data, we introduce CROWDLAB, a straightforward approach to estimate: (1) A consensus label for each example that aggregates the individual annotations (more accurately than aggregation via majority-vote or other algorithms used in crowdsourcing); (2) A confidence score for how likely each consensus label is correct (via well-calibrated estimates that account for the: number of annotations for each example and their agreement, prediction-confidence from a trained classifier, and trustworthiness of each annotator vs. the classifier); (3) A rating for each annotator quantifying the overall correctness of their labels. While many algorithms have been proposed to estimate related quantities in crowdsourcing, these often rely on sophisticated generative models with iterative inference schemes, whereas CROWDLAB is based on simple weighted ensembling. Many algorithms also rely solely on annotator statistics, ignoring the features of the examples from which the annotations derive. CROWDLAB in contrast utilizes any classifier model trained on these features, which can generalize between examples with similar features. In evaluations on real-world multi-annotator image data, our proposed method provides superior estimates for (1)-(3) than many alternative algorithms. Hui Wen Goh · Ulyana Tkachenko · Jonas Mueller 🔗 - Active-Learning-as-a-Service: An Automatic and Efficient MLOps System for Data-Centric AI (Poster) []  The success of today's AI applications requires not only model training (Model-centric) but also data engineering (Data-centric). In data-centric AI, active learning (AL) plays a vital role, but current AL tools 1) require users to manually select AL strategies, and 2) can not perform AL tasks efficiently. To this end, this paper presents an automatic and efficient MLOps system for AL, named ALaaS (Active-Learning-as-a-Service). Specifically, 1) ALaaS implements an AL agent, including a performance predictor and a workflow controller, to decide the most suitable AL strategies given users' datasets and budgets. We call this a predictive-based successive halving early-stop (PSHEA) procedure. 2) ALaaS adopts a server-client architecture to support an AL pipeline and implements stage-level parallelism for high efficiency. Meanwhile, caching and batching techniques are employed to further accelerate the AL process. In addition to efficiency, ALaaS ensures accessibility with the help of the design philosophy of configuration-as-a-service. Extensive experiments show that ALaaS outperforms all other baselines in terms of latency and throughput. Also, guided by the AL agent, ALaaS can automatically select and run AL strategies for non-expert users under different datasets and budgets. Our code is available at https://github.com/MLSysOps/Active-Learning-as-a-Service. Yizheng Huang · Huaizheng Zhang · Yuanming Li · Chiew Tong Lau · Yang You 🔗 - Feasible and Desirable Counterfactual Generation by Preserving Human Defined Constraints (Poster) We present a human-in-the-loop approach to generate counterfactual (CF) explanations that preserve global and local feasibility constraints. Global feasibility constraints refer to the causal constraints necessary for generating actionable CF explanation. Assuming a domain expert with knowledge on unary and binary causal constraints, our approach efficiently employs this knowledge to generate CF explanation by rejecting gradient steps that violate these constraints. Local feasibility constraints are user-level constraints necessary for generating desirable CF explanation. We extract these constraints from the end-user of the model and exploit them during CF generation via user-defined distance metric. Through user studies, we demonstrate that incorporating causal constraints during CF generation results in significantly better explanations in terms of feasibility and desirability for participants. Adopting local and global feasibility constraints simultaneously, although improves user satisfaction, does not significantly improves desirability of the participants compared to only incorporating global constraints. Homayun Afrabandpey · Michael Spranger 🔗 - Leveraging Human Features at Test-Time (Poster) Machine learning (ML) models can make decisions based on large amounts of data, but they may be missing important context. For example, a model trained to predict psychiatric outcomes may know nothing about a patient's social support system, and social support may look different for different patients. In this work, we explore strategies for querying for a small, additional set of these human features that are relevant for each specific instance at test time, so as to incorporate this information while minimizing the burden to the user to label feature values. We define the problem of querying users for an instance-specific set of human feature values, and propose algorithms to solve it. We show in experiments on real datasets that our approach outperforms a feature selection baseline that chooses the same set of human features for all instances. Isaac Lage · Sonali Parbhoo · Finale Doshi-Velez 🔗 - OpenAL: Evaluation and Interpretation of Active Learning Strategies (Poster) []  Despite the vast body of literature on Active Learning (AL), there is no comprehensive and open benchmark allowing for efficient and simple comparison of proposed samplers. Additionally, the variability in experimental settings across the literature makes it difficult to choose a sampling strategy, which is critical due to the one-off nature of AL experiments. To address those limitations, we introduce a flexible and open-source framework to easily run and compare sampling AL strategies on a collection of realistic tasks. The proposed benchmark is augmented with interpretability metrics and statistical analysis methods to understand when and why some samplers outperform others. Last but not least, practitioners can easily extend the benchmark by submitting their own AL samplers. William JONAS · Alexandre Abraham · Léo Dreyfus-Schmidt 🔗 - IAdet: Simplest human-in-the-loop object detection (Poster) []  This work proposes a strategy for training models while annotating data named Intelligent Annotation (IA). IA involves three modules: (1) assisted data annotation, (2) background model training, and (3) active selection of the next datapoints. Under this framework, we open-source the IAdet tool, which is specific for single-class object detection. Additionally, we devise a method for automatically evaluating such a human-in-the-loop system. For the PASCAL VOC dataset, the IAdet tool reduces the database annotation time by 25% while providing a trained model for free. These results are obtained for a deliberately very simple IAdet design. As a consequence, IAdet is susceptible to multiple easy improvements, paving the way for powerful human-in-the-loop object detection systems. Franco Marchesoni-Acland 🔗 - Towards customizable reinforcement learning agents: Enabling preference specification through online vocabulary expansion (Poster) There is a growing interest in developing automated agents that can work alongside humans. In addition to completing the assigned task, such an agent will undoubtedly be expected to behave in a manner that is preferred by the human. This requires the human to communicate their preferences to the agent. To achieve this, the current approaches either require the users to specify the reward function or the preference is interactively learned from queries that ask the user to compare trajectories. The former approach can be challenging if the internal representation used by the agent is inscrutable to the human while the latter is unnecessarily cumbersome for the user if their preference can be specified more easily in symbolic terms. In this work, we propose PRESCA (PREference Specification through Concept Acquisition), a system that allows users to specify their preferences in terms of concepts that they understand. PRESCA maintains a set of such concepts in a shared vocabulary. If the relevant concept is not in the shared vocabulary, then it is learned. To make learning a new concept more efficient, PRESCA leverages causal associations between the target concept and concepts that are already known. Additionally, the effort of learning the new concept is amortized by adding the concept to the shared vocabulary for supporting preference specification in future interactions. We evaluate PRESCA by using it on a Minecraft environment and show that it can be effectively used to make the agent align with the user's preference. Utkarsh Soni · Sarath Sreedharan · Mudit Verma · Lin Guan · Matthew Marquez · Subbarao Kambhampati 🔗 - Knowledge-driven Active Learning (Poster) The deployment of Deep Learning (DL) models is still precluded in those contexts where the amount of supervised data is limited. To answer this issue, active learning strategies aim at minimizing the amount of labelled data required to train a DL model. Most active strategies are based on uncertain sample selection, and even often restricted to samples lying close to the decision boundary. These techniques are theoretically sound, but an understanding of the selected samples based on their content is not straightforward, further driving non-experts to consider DL as a black-box. For the first time, here we propose a different approach, taking into consideration common domain-knowledge and enabling non-expert users to train a model with fewer samples. In our Knowledge-driven Active Learning (KAL) framework, rule-based knowledge is converted into logic constraints and their violation is checked as a natural guide for sample selection. We show that even simple relationships among data and output classes offer a way to spot predictions for which the model need supervision. The proposed approach (i) outperforms many active learning strategies in terms of average F1 score, particularly in those contexts where domain knowledge is rich. Furthermore, we empirically demonstrate that (ii) KAL discovers data distribution lying far from the initial training data unlike uncertainty-based strategies, (iii) it ensures domain experts that the provided knowledge is respected by the model on test data, and (iv) it can be employed even when domain-knowledge is not available by coupling it with a XAI technique. Finally, we also show that KAL is also suitable for object recognition tasks and, its computational demand is low, unlike many recent active learning strategies. Gabriele Ciravegna · Frederic Precioso · Marco Gori 🔗 - Minimizing Annotation Effort via Spectral Sampling (Poster) If one has a budget of N annotations, which samples should be annotated? We pro-pose a sampling strategy based on minimizing redundancy. Our method representsunlabeled data in the form of a Hankel matrix, and uses the notion of spectral max-volume to find a compact informative sub-block from which annotation samplesare drawn. Ariadna Quattoni 🔗 - Enabling Learning as a Joint Task via Paraphrasing (Poster) Human in the loop learning (HiLL) approaches involve a repeated exchange of information between human teachers and learning agents. Rather than adhering to the traditional paradigm of learning from human feedback, wherein an active learning system may repeatedly query a human teacher for information, we consider a shift to a more collaborative learning approach where the algorithmic learner can also share information with a human teacher. We propose that the same interactions types (Showing, Categorizing, Sorting, and Evaluating) that are effective in learning from human feedback should be effective in conveying information from the algorithmic learner, and that doing so will improve learning outcomes. We present examples of how these interactions can be used to share information from the algorithmic learner in the form of \textit{paraphrases}, outline a user study and experimental design for studying the impact of these paraphrases, and present metrics for evaluating their effects on learning and teaching outcomes. Pallavi Koppol · Russell Wong · Henny Admoni · Reid Simmons 🔗 - Symbol Guided Hindsight Priors for Reward Learning from Human Preferences (Poster) Specifying rewards for reinforcement learned (RL) agents is challenging. Preference-based RL (PbRL) mitigates these challenges by inferring a reward from feedback over sets of trajectories. However, the effectiveness of PbRL is limited by the amount of feedback needed to reliably recover the structure of the target reward. We present the PRIor Over Rewards (PRIOR) framework, which incorporates priors about the structure of the reward function and the preference feedback into the reward learning process. Our initial experiments demonstrate that imposing these priors as soft constraints on the reward learning objective reduces the amount of feedback required by half and improves overall reward recovery. Additionally, we demonstrate that using an abstract state space for the computation of the priors further improves the reward learning and the agent's performance. Mudit Verma · Katherine Metcalf 🔗 - Assistance with large language models (Poster) A core part of AI alignment is training AI systems to be helpful, or more generally, to interact with humans appropriately. We look at this problem in the context of large language models. Past works have focused on training these models to perform specific tasks, or follow instructions. In contrast, we believe helpfulness requires back-and-forth interaction between the AI and the human it is trying to assist. Here, we consider a multi-step interaction in which a human asks a question, and the AI has an opportunity to ask a clarifying question to resolve ambiguities before responding. The assistance framework formalizes the idea of an AI which aims to maximize the human's reward but is ignorant of the human reward function. Past works solved toy assistance environments using exact POMDP solvers as well as deep reinforcement learning. We apply a behavioral cloning approach, and fine-tune GPT-3 such that it can respond to clear input questions directly, clarify the intent behind vague input questions, and respond based on the clarification it receives. We show that this approach leads to quantitative improvements in answer accuracy compared to a baseline that cannot ask for clarifications. While the assistance framework assumes the correct behavior of an AI is to infer and maximize a human's reward, our approach can be used to learn any interaction protocol between the AI and the human. We believe exploring interaction protocols that are easy to learn robustly, and can be used to "bootstrap" further alignment are a promising direction for future research. Dmitrii Krasheninnikov · Egor Krasheninnikov · David Krueger 🔗 - Mapping of Financial Services datasets using Human-in-the-Loop (Poster) []  Increasing access to financial services data helps accelerate the monitoring and management of datasets and facilitates better business decision-making. However, financial services datasets are typically vast, ranging in terabytes of data, containing both structured and unstructured. It is a laborious task to comb through all the data and map them reasonably. Mapping the data is important to perform comprehensive analysis and take informed business decisions. Based on client engagements, we have observed that there is a lack of industry standards for definitions of key terms and a lack of governance for maintaining business processes. This typically leads to disconnected siloed datasets generated from disintegrated systems. To address these challenges, we developed a novel methodology DaME (Data Mapping Engine) that performs data mapping by training a data mapping engine and utilizing human-in-the-loop techniques. The results from the industrial application and evaluation of DaME on a financial services dataset are encouraging that it can help automate data mapping and improve system human-in-the-loop learning. The accuracy from our dataset in the application is much higher at 69\% compared to the existing state-of-the-art with an accuracy of 34\%. It has also helped improve the productivity of the industry practitioners, by saving them 14,000 hours of time spent manually mapping vast data stores over a period of ten months. SHUBHI ASTHANA · Ruchi Mahindru 🔗 - Batch Active Learning from the Perspective of Sparse Approximation (Poster) Active learning enables efficient model training by leveraging interactions between machine learning agents and human annotators. We study and propose a novel framework that formulates batch active learning from the sparse approximation's perspective. Our active learning method aims to find an informative subset from the unlabeled data pool such that the corresponding training loss function approximates its full data pool counterpart. We realize the framework as sparsity-constrained discontinuous optimization problems, which explicitly balance uncertainty and representation for large-scale applications and could be solved by greedy or proximal iterative hard thresholding algorithms. The proposed method can adapt to various settings, including both Bayesian and non-Bayesian neural networks. Numerical experiments show that our work achieves competitive performance across different settings with lower computational complexity. Maohao Shen · Yibo Jacky Zhang · Bowen Jiang · Sanmi Koyejo 🔗 - Contextual Visual Feature Learning for Zero-Shot Recognition of Human-Object Interactions (Poster) Real-world visual recognition of an object involves not only its own semantics but also those surrounding it. Supervised learning of contextual relationships is restrictive and impractical with the combinatorial explosion of possible relationships among a group of objects. Our key insight is to formulate visual context not as a relationship classification problem, but as a representation learning problem, where objects located close in the feature space have similar visual contexts. Such a model is infinitely scalable with respect to the number of objects or their relationships.We develop a contextual visual feature learning model without any supervision on relationships. We characterize visual context in terms of spatial configuration of semantics between objects and their surrounds, and derive pixel-to-segment learning losses that capture visual similarity, semantic co-occurrences, and structural correlation. Visual context emerges in a completely data-driven fashion, with objects in similar contexts mapped to close points in the feature space. Most strikingly, when benchmarked on HICO for recognizing human-object interactions, our unsupervised model trained only on MSCOCO significantly outperforms the supervised baseline and approaches the supervised state-of-the-art, both trained specifically on HICO with annotated relationships! Tsung-Wei Ke · Dong-Jin Kim · Stella Yu · Liang Gou · Liu Ren 🔗 - TASSAL: Task-Aware Semi-Supervised Active Learning (Poster) Active learning (AL) is useful for incremental model training while concurrently selecting the most informative yet minimum amount of training samples for the human annotator to label. In this paper, We introduce a pool-based efficient task-aware semi-supervised active learning (TASSAL) strategy with a modified selection probability distribution on initial sampling. In contrast to the recent approaches that separately train the core task model from the encoding models, our method allows end-to-end and simultaneous learning on both the encoding and core task model. We demonstrate that TASSAL shows a promising result and, from our experiments, it could outperform recent approaches on the CIFAR-10 dataset. In addition, evaluation of the pseudo-labeled samples against ground-truth labels shows that such an approach could potentially yield additional data to train on, which supposedly is beneficial for the downstream tasks. Erick Chandra · Ramesh Manuvinakurike · Saurav Sahay · Sahisnu Mazumder · Ranganath Krishnan · Jane Yung-jen Hsu 🔗 - Offline Robot Reinforcement Learning with Uncertainty-Guided Human Expert Sampling (Poster) Recent advances in batch (offline) reinforcement learning have shown promising results in learning from available offline data and proved offline reinforcement learning to be an essential toolkit in learning control policies in a model-free setting. An offline reinforcement learning algorithm applied to a dataset collected by a suboptimal non-learning-based algorithm can result in a policy that outperforms the behavior agent used to collect the data. Such a scenario is frequent in robotics, where existing automation is collecting operational data. Although offline learning techniques can learn from data generated by a sub-optimal behavior agent, there is still an opportunity to improve the sample complexity of existing offline reinforcement learning algorithms by strategically introducing human demonstration data into the training process. To this end, we propose a novel approach that uses uncertainty estimation to trigger the injection of human demonstration data and guide policy training towards optimal behavior while reducing overall sample complexity. Our experiments show that this approach is more sample efficient when compared to a naive way of combining expert data with data collected from a sub-optimal agent. We augmented an existing offline reinforcement learning algorithm Conservative Q-Learning with our approach and performed experiments on data collected from MuJoCo and OffWorld Gym learning environments. Ashish Kumar · Ilya Kuzovkin 🔗 - Advice Conformance Verification by Reinforcement Learning agents for Human-in-the-Loop (Poster) Human-in-the-loop (HiL) reinforcement learning is gaining traction in domains with large action and state spaces, and sparse rewards by allowing the agent to take advice from HiL. Beyond advice accommodation, a sequential decision-making agent must be able to express the extent to which it was able to utilize the human advice. Subsequently, the agent should provide a means for the HiL to inspect parts of advice that it had to reject in favor of the overall environment objective. We introduce the problem of Advice-Conformance Verification which requires reinforcement learning (RL) agents to provide assurances to the human in the loop regarding how much of their advice is being conformed to. We then propose a Tree-based lingua-franca to support this communication, called a Preference Tree. We study two cases of good and bad advice scenarios in MuJoCo's Humanoid environment. Through our experiments, we show that our method can provide an interpretable means of solving the Advice-Conformance Verification problem by conveying whether or not the agent is using the human's advice. Finally, we present a human-user study with 20 participants that validates our method. Mudit Verma · Ayush Kharkwal · Subbarao Kambhampati 🔗 - A Simple Framework for Active Learning to Rank (Poster) Learning to rank (LTR) plays a critical role in search engine---there needs to timely label an extremely large number of queries together with relevant webpages to train and update the online LTR models. To reduce the costs and time consumption of queries/webpages labeling, we study the problem of \emph{Active Learning to Rank} (\emph{\bf active LTR}) that selects unlabeled queries for annotation and training in this work. Specifically, we first investigate the criterion--\emph{Ranking Entropy (RE)} characterizing the entropy of relevant webpages under a query produced by a sequence of online LTR models updated by different checkpoints, using a Query-By-Committee (QBC) method. Then, we explore a new criterion namely \emph{Prediction Variances (PV)} that measures the variance of prediction results for all relevant webpages under a query. Our empirical studies find that RE may favor low-frequency queries from the pool for labeling while PV prioritizing high-frequency queries more. Finally, we combine these two complementary criteria as the sample selection strategies for active learning. Extensive experiments with comparisons to baseline algorithms show that the proposed approach could train LTR models achieving higher Discounted Cumulative Gain (\ie, the relative improvement $\Delta$DCG$_4$=1.38\%) with the same budgeted labeling efforts, while the proposed strategies could discover 43\% more valid training pairs for effective training. Qingzhong Wang · Haifang Li · Haoyi Xiong · Wen Wang · Jiang Bian · Yu Lu · Shuaiqiang Wang · zhicong cheng · Dawei Yin · Dejing Dou 🔗 - Learning from Data through Human-Machine Collaboration (Poster) []  Machine learning (ML) is considered an effective and efficient tool for extracting useful information from vast amounts of data. Indeed, it is increasingly applied for solving real-life problems in industry and academic research. However, the main problem is that applying ML requires an interdisciplinary education that, for example, allows domain experts to tune the parameters and interpret the analysis. As a result, there is an increasing demand for solutions that enable domain experts to apply Machine Learning approaches to their datasets without consulting ML experts. In this scenario, we propose a new paradigm that allows machine and human intelligence to cooperate to join both ML and domain expertise for analyzing user data and producing answers. As proof of concept, we start developing MLAssistant, a library that understands the research question with the help of user interaction, produces a data science pipeline, and automatically executes the pipeline in order to generate analysis. The strength of MLAssistant lies in the design of a rich domain-specific language for modeling data analysis pipelines, the use of a suitable neural network for machine translation of research questions, the availability of a vast dictionary of pipelines for matching the translation output, and the use of natural language technology. Sara Pido · Pietro Crovari · Pietro Pinoli 🔗 - Consistent Training via Energy-Based GFlowNets for Modeling Discrete Joint Distributions (Poster) []  Generative Flow Networks (GFlowNets) have demonstrated significant performance improvements for generating diverse discrete objects x given a reward function R(x), indicating the utility of the object and trained independently from the GFlowNet by supervised learning to predict a desirable property y given x. We hypothesize that this can lead to incompatibility between the inductive optimization biases in training R and in training the GFlowNet, potentially leading to worse samples and slow adaptation to changes in the distribution. In this work, we build upon recent work on jointly learning energy-based models with GFlowNets and extend it to learn the joint over multiple variables, which we call Joint Energy-Based GFlowNets (JEBGFNs), such as peptide sequences and their antimicrobial activity. Joint learning of the energy-based model, used as a reward for the GFlowNet, can resolve the issues of incompatibility since both the reward function R and the GFlowNet sampler are trained jointly. We find that this joint training or joint energy-based formulation leads to significant improvements in generating anti-microbial peptides. As the training sequences arose out of evolutionary or artificial selection for high antibiotic activity, there is structure in the distribution of sequences that reveals information about the antibiotic activity, giving an advantage to modeling their joint generatively vs. pure discriminative modeling. We also evaluate JEBGFN in an active learning setting for discovering anti-microbial peptides. Chanakya Ekbote · Moksh Jain · Payel Das · Yoshua Bengio 🔗 - Relative Behavioral Attributes: Filling the Gap between Symbolic Goal Specification and Reward Learning from Human Preferences (Poster) Generating complex behaviors from goals specified by non-expert users is a crucial aspect of intelligent agents. Interactive reward learning from trajectory comparisons is one way to allow non-expert users to convey complex objectives by expressing preferences over short clips of agent behaviors. Even though this method can encode complex tacit knowledge present in the underlying tasks, it implicitly assumes that the human is unable to provide rich-form feedback other than binary preference labels, leading to extremely high feedback complexity and poor user experience. While providing a detailed symbolic specification of the objectives might be tempting, it is not always feasible even for an expert user. However, in most cases, humans are aware of how the agent should change its behavior along meaningful axes to fulfill the underlying purpose, even if they are not able to fully specify task objectives symbolically. Using this as motivation, we introduce the notion of Relative Behavioral Attributes, which acts as a middle ground, between exact goal specification and reward learning purely from preference labels, by enabling the users to tweak the agent's behavior through nameable concepts (e.g., decreasing the steering sharpness of an autonomous driving agent, or increasing the softness of the movement of a two-legged "sneaky" agent). We propose two different parametric methods that can potentially encode any kind of behavioral attributes from ordered behavior clips. We demonstrate the effectiveness of our methods on 4 tasks with 9 different behavioral attributes and show that once the attributes are learned, end users can effortlessly produce desirable agent behaviors, by providing feedback just around 10 times. The feedback complexity of our approach is over 10 times less than the learning-from-human-preferences baseline and this demonstrates that our approach is readily applicable in real-world applications. Lin Guan · Karthik Valmeekam · Subbarao Kambhampati 🔗 - Neural-Symbolic Recursive Machine for Systematic Generalization (Poster) Despite the tremendous success, existing machine learning models still fall short of human-like systematic generalization—learning compositional rules from limited data and applying them to unseen combinations in various domains. We propose Neural-Symbolic Recursive Machine (NSR) to tackle this deficiency. The core representation of NSR is a Grounded Symbol System (GSS) with combina- torial syntax and semantics, which entirely emerges from training data. Akin to the neuroscience studies suggesting separate brain systems for perceptual, syntactic, and semantic processing, NSR implements analogous separate modules of neural perception, syntactic parsing, and semantic reasoning, which are jointly learned by a deduction-abduction algorithm. We prove that NSR is expressive enough to model various sequence-to-sequence tasks. Superior systematic generalization is achieved via the inductive biases of equivariance and recursiveness embedded in NSR. In experiments, NSR achieves state-of-the-art performance in three benchmarks from different domains: SCAN for semantic parsing, PCFG for string manipulation, and HINT for arithmetic reasoning. Specifically, NSR achieves 100% generalization accuracy on SCAN and PCFG and outperforms state-of-the-art models on HINT by about 23%. Our NSR demonstrates stronger generalization than pure neural networks due to its symbolic representation and inductive biases. NSR also demonstrates better transferability than existing neural-symbolic approaches due to less domain-specific knowledge required. Qing Li · Yixin Zhu · Yitao Liang · Ying Nian Wu · Song-Chun Zhu · Siyuan Huang 🔗 - A Comparative Survey of Deep Active Learning (Poster) While deep learning (DL) is data-hungry and usually relies on extensive labeled data to deliver good performance, Active Learning (AL) reduces labeling costs by selecting a small proportion of samples from unlabeled data for labeling and training. Therefore, Deep Active Learning (DAL) has risen as a feasible solution for maximizing model performance under a limited labeling cost/budget in recent years. Abundant DAL methods and various literature reviews have been developed and conducted. In this work, we survey and categorize DAL-related works and construct comparative experiments across $10$ frequently used image classification datasets and $19$ DAL algorithms based on \emph{$\text{DeepAL}^+$} toolbox. Our work is the largest comparative study to date. Additionally, we explore some factors (e.g., batch size, number of epochs in the training process) that influence the efficacy of DAL, which provides better references for researchers to design their DAL experiments or carry out DAL-related applications. Xueying Zhan · Qingzhong Wang · Kuan-Hao Huang · Haoyi Xiong · Dejing Dou · Antoni Chan 🔗 - Improving the Strength of Human-Like Models in Chess (Poster) Designing AI systems that capture human-like behavior has attracted growing attention in applications where humans may want to learn from, or need to collaborate with, these AI systems. Many existing works in designing human-like AI have taken a supervised learning approach that learns from data of human behavior, with the goal of creating models that can accurately predict human behavior. While this approach has shown success in capturing human behavior, it also suffers from the drawback of mimicking human mistakes. Moreover, existing models only capture a snapshot of human behavior, leaving the question of how to improve them largely unanswered. Using chess as an experimental domain, we investigate the question of teaching an existing human-like model to be stronger using a data-efficient curriculum, while maintaining the model's human similarity. To achieve this goal, we extend the concept of curriculum learning to settings with multiple labeling strategies, allowing us to vary both the curriculum (dataset) and the teacher (labeling strategy). We find that the choice of teacher has a strong impact on both playing strength and human similarity; for example, a teacher that is too strong can be less effective at improving playing strength and degrade human similarity more rapidly. We also find that the choice of curriculum can impact these metrics, but to a smaller extent. Finally, we show that our strengthened models achieve human similarity at higher-level datasets, suggesting human-like improvement. Saumik Narayanan · Kassa Korley · Chien-Ju Ho · Siddhartha Sen 🔗 - Fast Adaptation via Human Diagnosis of Task Distribution Shift (Poster) []  When agents fail in the world, it is important to understand why. Failures are due to underlying distribution shifts in the goals desired by the end user or to the environment layouts that impact the policy's actions. In the case of multi-task policies conditioned on goals, this problem manifests in difficulty in disambiguating between goal and policy failures: is the agent failing because it can't correctly infer what the desired goal is or because it doesn't know how to take actions toward achieving the goal? We hypothesize that successfully disentangling these two failures modes holds important implications for selecting a finetuning strategy. In this paper, we explore the feasibility of leveraging human feedback to diagnose what vs. how failures for efficient adaptation. We develop an end-to-end policy training framework that uses attention to produce a human-interpretable representation, a visual masked state, to communicate the agent's intermediate task representation. In experiments with human users in both discrete and continuous control domains, we show that our visual attention mask policy can aid participants in successfully inferring the agent's failure mode significantly better than actions alone. Leveraging this feedback, we show subsequent performance gains during finetuning and discuss implications of using humans to diagnose parameter-level failures. Andi Peng · Mark Ho · Aviv Netanyahu · Julie A Shah · Pulkit Agrawal 🔗 - Participatory Systems for Personalized Prediction (Poster) Machine learning models often request personal information from users to assign more accurate predictions across a heterogeneous population. Personalized models are not built to support informed consent: users cannot "opt-out" of providing personal data, nor understand the effects of doing so. In this work, we introduce a family of personalized prediction models called participatory systems that support informed consent. Participatory systems are interactive prediction models that let users opt into reporting additional personal data at prediction time, and inform them about how their data will improve their predictions. We present a model-agnostic approach for supervised learning tasks where personal data is encoded as "group" attributes (e.g., sex, age group, HIV status). Given a pool of user-specified models, our approach can create a variety of participatory systems that differ in their training requirements and opportunities for informed consent. We conduct a comprehensive empirical study of participatory systems in clinical prediction tasks and compare them to common approaches for personalization. Our results show that our approach can produce participatory systems that exhibit large improvements in privacy, fairness, and performance at the population and group levels. Hailey James · Berk Ustun · Chirag Nagpal · Katherine Heller 🔗 - Digital Human Interactive Recommendation Decision-Making Based on Reinforcement Learning (Poster) Digital human recommendation system has been developed to help customers find their favorite products and is playing an active role in various recommendation contexts. How to timely catch and learn the dynamics of the preferences of the customers, while meeting their exact requirements, becomes crucial in the digital human recommendation domain. We design a novel practical digital human interactive recommendation agent framework based on Reinforcement Learning(RL) to improve the efficiency of the interactive recommendation decision-making by leveraging both the digital human features and the superior flexibility of RL. Our proposed framework learns through real-time interactions between the digital human and customers dynamically through the state-of-art RL algorithms, combined with multimodal embedding and graph embedding, to improve the accuracy of personalization and thus enable the digital human agent to timely catch the attention of the customer. Experiments on real business data demonstrate that our framework can provide better personalized customer engagement and better customer experiences. Junwu Xiong 🔗 - Can Calibration Improve Sample Prioritization? (Poster) []  Calibration can reduce overconfident predictions of deep neural networks, but can calibration also accelerate training? In this paper, we show that it can when used to prioritize some examples for performing subset selection. We study the effect of popular calibration techniques in selecting better subsets of samples during training (also called sample prioritization) and observe that calibration can improve the quality of subsets, reduce the number of examples per epoch (by at least 70%), and can thereby speed up the overall training process. We further study the effect of using calibrated pre-trained models coupled with calibration during training to guide sample prioritization, which again seems to improve the quality of samples selected. Ganesh Tata · Gautham Krishna Gudur · Gopinath Chennupati · Mohammad Emtiyaz Khan 🔗 - Conformal Prediction for Resource Prioritisation in Predicting Rare and Dangerous Outcomes (Poster) In a growing number of high-stakes decision-making scenarios, experts are aided by recommendations from machine learning (ML) models. However, predicting rare but dangerous outcomes can prove challenging for both humans and machines. Here we simulate a setting where ML models help law enforcement prioritise human effort in monitoring individuals undergoing radicalisation. We discuss the utility of set-valued predictions in guaranteeing the maximal rate at which dangerous radicalized individuals are missed by an assisted decision-making system. We demonstrate the trade-off between risk and the required human effort. We show that set-valued predictions can help better allocate resources whilst controlling the number of high-risk individuals missed. This work explores using conformal prediction and more general risk control methods for assisting in predicting rare and critical outcomes, and developing methods for more expert-aligned prediction sets. Varun Babbar · Umang Bhatt · Miri Zilka · Adrian Weller 🔗 - A Proposal For An Interactive Parliamentary Debate Adjudication System (Poster) The field of argumentation has seen significant growth in recent years through the introduction of Project Debater, Speech by Crowd and several other works that aim to create systems and datasets that can effectively debate humans.While there have been several such works, no system exists that adjudicates a debate to find the winner and provide participants with feedback.Moreover, existing work relies on hundreds, sometimes thousands of annotators to provide arguments and scores, which is costly and time consuming.In this paper, we propose a preliminary idea for a system that can generate verdict and feedback for a debate and learn interactively from other judges by asking questions about the verdict. Priya Pitre · Omkar Joshi 🔗 - ArgAnalysis35K - A large scale dataset for Argument Quality Detection (Poster) Argument Quality Detection is an emerging field in NLP which has seen significant recent development. However, existing datasets in this field suffer from a lack of quality, quantity and diversity of topics and arguments, specifically the presence of vague arguments that are not persuasive in nature. In this paper, we leverage a combined experience of 10+ years of Parliamentary Debating to create a dataset that covers significantly more topics and has a wide range of sources to capture more diversity of opinion. With 35k high-quality arguments, this is also the largest dataset of its kind to our knowledge. In addition to this contribution, we introduce an innovative argument scoring system based on instance-level annotator reliability and propose a quantitative model of scoring the relevance of arguments to a range of topics. Omkar Joshi · Priya Pitre · Dr. Mrs. Yashodhara V. Haribhakta 🔗 - Temporal Logic Imitation: Learning Plan-Satisficing Motion Policies from Demonstrations (Poster) Learning from demonstration (LfD) has successfully solved tasks featuring a long time horizon. However, when the problem complexity also includes human-in-the-loop perturbations, state-of-the-art approaches do not guarantee the successful reproduction of a task. In this work, we identify the roots of this challenge as the failure of a learned continuous policy to satisfy the discrete plan implicit in the demonstration. By utilizing modes (rather than subgoals) as the discrete abstraction and motion policies with both mode invariance and goal reachability properties, we prove our learned continuous policy can simulate any discrete plan specified by a linear temporal logic (LTL) formula. Consequently, an imitator is robust to both task- and motion-level perturbations and guaranteed to achieve task success. Project page: https://sites.google.com/view/ltl-ds Felix Yanwei Wang · Nadia Figueroa · Shen Li · Ankit Shah · Julie A Shah 🔗 - Exploratory Training: When Trainers Learn (Poster) AI and Data systems often present examples and solicit labels from users to learn a target concept. This selection of examples could be even done in an active fashion i.e., active learning. Current systems assume that users always provide correct labeling with potentially a fixed and small chance of mistake. In several settings, users may have to explore and learn about the underlying data to label examples correctly, particularly for complex target concepts and models. For example, to provide accurate labeling for a model of detecting noisy or abnormal values, users might need to investigate the underlying data to understand typical and clean values in the data. As users gradually learn about the target concept and data, they may revise their labeling strategies. Due to the significance and non-stationarity of errors in this setting, current systems may use incorrect labels and learn inaccurate models from the users. We report preliminary results for a user study over real- world datasets on modeling human learning during training the system and layout the next steps in this investigation. Rajesh Shrestha · Omeed Habibelahian · Arash Termehchy · Papotti Paolo 🔗 - Online Continual Learning from Imbalanced Data with Kullback-Leibler-loss based replay buffer updates (Poster) []  We propose an online replay-based Continual Learning policy, in which the learner stores data points to a local buffer and replays it during training. The core of our contribution is a new replay buffer content update policy that combines a Kullback-Leibler (K-L) loss and an appropriate modification of the celebrated Reservoir Sampling algorithm. The decisions at each time are, whether the newly arriving training data points will be inserted in the buffer, and which existing data points from the buffer will be substituted. We update the buffer content so that the proportion of stored data points from different classes in the buffer approximates a target distribution that depends on the empirical distribution of classes seen in the training data stream. We parameterize the target distribution with a single parameter that allows us to model different target class distributions in the buffer, ranging from the class distribution in the training data stream, the uniform class distribution, and a distribution with class percentages that are inversely proportional to those in the training data stream. We evaluate our method on MNIST, Fashion-MNIST, CIFAR-10 and CIFAR-100 and we show that our method is superior to the state-of-the-art Reservoir Sampling algorithm. Our main finding is that the best (in terms of accuracy and forgetting) value of the parameter that determines the distribution of classes in the buffer versus that of the stream depends on statistics of the training data and on the dataset itself. Our work paves the way for further work to learn this parameter in the realistic scenario that it is unknown, thus contributing to the objective of an optimal replay-based continual learning approach that adapts to the specifics of each scenario. Sotirios Nikoloutsopoulos · Iordanis Koutsopoulos · Michalis Titsias 🔗 - Interactive Concept Bottleneck Models (Poster) Concept bottleneck models (CBMs) are interpretable neural networks that first predict labels for human-interpretable concepts relevant to the prediction task, and then predict the final label based on the concept label predictions. We extend CBMs to interactive prediction settings where the model can query a human collaborator for the label to some concepts. We develop an interaction policy that, at prediction time, chooses which concepts to request a label for so as to maximally improve the final prediction. We demonstrate that a simple policy combining concept prediction uncertainty and influence of the concept on the final prediction achieves strong performance and outperforms a static approach proposed in as well as active feature acquisition methods proposed in the literature. We show that the interactive CBM can achieve accuracy gains of 5-10\% with only 5 interactions over competitive baselines on the Caltech UCSB Birds dataset and the Chexpert dataset. Kushal Chauhan · Rishabh Tiwari · Jan Freyberg · Pradeep Shenoy · Krishnamurthy Dvijotham 🔗 - Achieving Diversity and Relevancy in Zero-Shot Recommender Systems for Human Evaluations (Poster) Recommender systems (RecSys) often require user-behavioral data to learn good preference patterns. However, the user data is often collected by a working RecSys in the first place. This creates a gap where we hope to establish general recommendation patterns without relying on user data first, while the performance is then evaluated by real human oracles. On top of that, we aim to introduce diversity in the recommendation results, based on uncertainty principles to yield good trade-offs between recommendation coverage and relevancy.Assuming that we have a corpus of item descriptions for all the items in our recommendation catalog, we propose two methods based on pretrained large language models (LLMs): Bert Corpus Tuning (Bert-CT) and Bert Variational Corpus Tuning (Bert-VarCT). Here, Bert-CT is responsible for adapting Bert to attend to domain-specific word tokens in the corpus of the item descriptions and Bert-VarCT is used to introduce diversity without significant changes in the network designs. We show that both methods achieved our designed goals, measured by data from real humans on a crowd-sourcing platform. Additionally, our approach is general and minimalistic. We release our codes for reproducibility and extensibility at \url{https://github.com/awslabs/crowd-coachable-recommendations} Tiancheng Yu · Yifei Ma · Anoop Deoras 🔗 - Differentiable User Models (Poster) Probabilistic user modeling is essential for building collaborative AI systems within probabilistic frameworks. However, modern advanced user models, often designed as cognitive behavior simulators, are computationally prohibitive for interactive use in cooperative AI assistants. In this extended abstract, we address this problem by introducing widely-applicable differentiable surrogates for bypassing this computational bottleneck; the surrogates enable using modern behavioral models with online computational cost which is independent of their original computational cost. We show experimentally that modeling capabilities comparable to likelihood-free inference methods are achievable, with over eight orders of magnitude reduction in computational time. Finally, we demonstrate how AI-assistants can computationally feasibly use cognitive models in a previously studied menu-search task. Alex Hämäläinen · Mustafa Mert Çelikok · Samuel Kaski 🔗 - End-user-centered Interactive Explanatory Relational Learning with Inductive Logic Programming (Poster) This paper shows how improved interactive interfaces can afford end-users better control over expressive logic-based machine learners in order to help circumvent the problem of overfitting on confounding factors in complex relational data. Prior work has shown such confounders commonly occur in real-world data sets in the form of incidental correlations arising from sampling biases, modelling artifacts, labelling errors or simply due to chance occurrences in the training data. Inductive Logic Programming (ILP) is a logical machine learning methodology that can help users address this problem by providing them with hypotheses that are readily understandable and editable by humans. Moreover, because ILP operates directly on relational data which need not be collapsed into finite feature vectors, ILP potentially enables identification of complex relational confounders - which have not been studied until now. This paper proposes an interactive dashboard to make a state-of-the-art interactive ILP system accessible to end-users without a background in computational logic. We present a proof-of-principal case study which shows how users can intuitively identify and circumvent relational confounders in a new synthetic dataset that we derived from prior work in this field. Oliver Deane 🔗 - Time-Efficient Reward Learning via Visually Assisted Cluster Ranking (Poster) []  One of the most successful paradigms for reward learning uses human feedback in the form of comparisons. Although these methods hold promise, human comparison labeling is expensive and time consuming, constituting a major bottleneck to their broader applicability. Our insight is that we can greatly improve how effectively human time is used in these approaches by batching comparisons together, rather than having the human label each comparison individually. To do so, we leverage data dimensionality-reduction and visualization techniques to provide the human with a interactive GUI displaying the state space, in which the user can label subportions of the state space. Across some simple Mujoco tasks, we show that this high-level approach holds promise and is able to greatly increase the performance of the resulting agents, provided the same amount of human labeling time. David Zhang · Micah Carroll · Andreea Bobu · Anca Dragan 🔗 - Optimal Behavior Prior: Data-Efficient Human Models for Improved Human-AI Collaboration (Poster) []  AI agents designed to collaborate with people benefit from models that enable them to anticipate human behavior. However, realistic models tend to require vast amounts of human data, which is often hard to collect. A good prior or initialization could make for more data-efficient training, but what makes for a good prior on human behavior? Our work leverages a very simple assumption: people generally act closer to optimal than to random chance. We show that using optimal behavior as a prior for human models makes these models vastly more data-efficient and able to generalize to new environments. Our intuition is that such a prior enables the training to focus one's precious real-world data on capturing the subtle nuances of human suboptimality, instead of on the basics of how to do the task in the first place. We also show that using these improved human models often leads to better human-AI collaboration performance compared to using models based on real human data alone. Mesut Yang · Micah Carroll · Anca Dragan 🔗 - Learning Topological Representation of Sensor Network with Persistent Homology in HCI Systems (Poster) Hand gesture and movement analysis is a crucial learning task in Human-computer interaction (HCI) applications. Sensor-based HCI systems simultaneously capture the information with multiple locations to track the coordination of different regions of muscles. Based on the fact that there exists a temporal correlation between the regions, the connectivity analysis of sensor signals builds a network. The graph-based approach for analyzing the sensor network has provided novel insight into the learning in HCI, which has not been broadly investigated in hand gesture recognition tasks. This work proposes a topological representation learning scheme as a graph-based approach for sensor network analysis. Through investigation of the topological properties with persistent homology, the spatial-temporal characteristics are well described to build recognition models. Experiments on the NinaPro DB-2, DB-4, DB-5, and DB-7 datasets with sensor networks built with sEMG signal and IMU signal demonstrate exceptional performance of the proposed topological approach. The topological features are effective in graph representation learning with sensor networks used in hand gesture recognition tasks. The proposed work provides a novel learning scheme in HCI systems and human-in-the-loop studies. Yan Yan · Cheng-Dong Li · Jing Xiong · Lei Wang 🔗 - On the Ramifications of Human Label Uncertainty (Poster) Humans exhibit disagreement during data labeling. We term this disagreement as human label uncertainty. In this work, we study the ramifications of human label uncertainty (HLU). Our evaluation of existing uncertainty estimation algorithms, with the presence of HLU, indicates the limitations of existing uncertainty metrics and algorithms themselves in response to HLU. Meanwhile, we observe undue effects in predictive uncertainty and generalizability. To mitigate the undue effects, we introduce a novel natural scene statistics (NSS) based label dilution training scheme without requiring massive human labels. Specifically, we first select a subset of samples with low perceptual quality ranked by statistical regularities of images. We then assign separate labels to each sample in this subset to obtain a training set with diluted labels. Our experiments and analysis demonstrate that training with NSS-based label dilution alleviates the undue effects caused by HLU. Chen Zhou · Mohit Prabhushankar · Ghassan AlRegib 🔗 - A Study of Human-Robot Handover through Human-Human Object Transfer (Poster) []  In this preliminary study, we investigate changes in handover behaviour when transferring hazardous objects with the help of the see-through-your-skin (STS), visuotactile sensor. Participants were asked to hand over a safe and hazardous object (a full cup and an empty cup) while instrumented with a modified STS sensor. Our data shows a clear difference in the length of handover for the full cup vs the empty one, with the former being slower. Sensor data further supports a change in handover behaviour dependent on object risk. The results of this paper motivate a deeper study of tactile factors which could characterize a risky handover allowing for safer human-robot interactions in the future. Charlotte Morissette · Bobak Baghi · Francois Hogan · Gregory Dudek 🔗 - PyTAIL - Interactive and Incremental Learning of NLP Models with Human in the Loop for Online Data (Poster) Online data streams make training machine learning models hard because of distribution shift and new patterns emerging over time. For natural language processing (NLP) tasks that utilize a collection of features based on lexicons and rules, it is important to adapt these features to the changing data. To address this challenge we introduce PyTAIL, a python library, which allows a human in the loop approach to actively train NLP models. PyTAIL enhances generic active learning, which only suggests new instances to label by also suggesting new features like rules and lexicons to label. Furthermore, PyTAIL is flexible enough for users to accept, reject, or update rules and lexicons as the model is being trained. Finally, we simulate the performance of PyTAIL on existing social media benchmark datasets for text classification. We compare various active learning strategies on these benchmarks. The model closes the gap with as few as 10\% of the training data. Finally, we also highlight the importance of tracking evaluation metric on remaining data (which is not yet merged with active learning) alongside the test dataset. This highlights the effectiveness of the model in accurately annotating the remaining dataset, which is especially suitable for batch processing of large unlabelled corpora. PyTAIL will be available at https://github.com/socialmediaie/pytail. Shubhanshu Mishra · Jana Diesner 🔗