`

Timezone: »

 
Workshop
Workshop on Human and Machine Decisions
Daniel Reichman · Joshua Peterson · Kiran Tomlinson · Annie Liang · Tom Griffiths

Tue Dec 14 06:05 AM -- 03:10 PM (PST) @ None
Event URL: https://sites.google.com/view/whmd2021/ »

Understanding human decision-making is a key focus of behavioral economics, psychology, and neuroscience with far-reaching applications, from public policy to industry. Recently, advances in machine learning have resulted in better predictive models of human decisions and even enabled new theories of decision-making. On the other hand, machine learning systems are increasingly being used to make decisions that affect people, including hiring, resource allocation, and paroles. These lines of work are deeply interconnected: learning what people value is crucial both to predict their own decisions and to make good decisions for them. In this workshop, we will bring together experts from the wide array of disciplines concerned with human and machine decisions to exchange ideas around three main focus areas: (1) using theories of decision-making to improve machine learning models, (2) using machine learning to inform theories of decision-making, and (3) improving the interaction between people and decision-making AIs.

Tue 6:05 a.m. - 6:20 a.m.
Opening remarks
Tue 6:20 a.m. - 6:50 a.m.
Sarit Kraus (Keynote)
Sarit Kraus
Tue 6:50 a.m. - 7:20 a.m.
Drew Fudenberg (Keynote)
Drew Fudenberg
Tue 7:20 a.m. - 7:30 a.m.
Break
Tue 7:30 a.m. - 8:00 a.m.
Duncan Watts (Keynote)
Duncan J Watts
Tue 8:00 a.m. - 9:00 a.m.
Panel I: Human decisions (Panel)
Jennifer Trueblood, Alex Peysakhovich, Angela Yu, Ori Plonsky, Tal Yarkoni, Daniel Bjorkegren
Tue 9:00 a.m. - 9:30 a.m.
Break
Tue 9:30 a.m. - 10:00 a.m.
Colin Camerer (Keynote)
Colin Camerer
Tue 10:00 a.m. - 11:00 a.m.
Keynote speakers Q&A (Panel)
Sarit Kraus, Drew Fudenberg, Duncan J Watts, Colin Camerer, Johan Ugander, Emma Pierson
Tue 11:00 a.m. - 11:10 a.m.

Machine learning-based tools have drawn increasing interest from public policy practitioners, yet our understanding of the effectiveness of such tools when paired with human decision makers is limited. Using a randomized control trial, we evaluate the effects of an established algorithmic decision aid tool implemented by a U.S. child welfare agency. Halfway through the trial, this paper presents preliminary evidence on the effects of showing a child’s predicted risk on child welfare decision outcomes. Child welfare workers are already sensitive to underlying risk as measured by the algorithmic tool. Making the score available to workers, however, seems to improve even more the targeting of child welfare visits.

Marie-Pascale Grimon, Christopher Mills
Tue 11:10 a.m. - 11:40 a.m.
Johan Ugander (Keynote)
Johan Ugander
Tue 11:40 a.m. - 12:00 p.m.
Break
Tue 12:00 p.m. - 1:00 p.m.
Panel II: Machine decisions (Panel)
Anca Dragan, Karen Levy, Himabindu Lakkaraju, Ariel Rosenfeld, Maithra Raghu, Irene Y Chen
Tue 1:00 p.m. - 1:10 p.m.

When faced with (automated) assessment rules, individuals can modify their observable features strategically to obtain better decisions. In many situations, decision-makers deliberately keep the underlying predictive model secret to avoid gaming. This forces the decision subjects to rely on incomplete information when making strategic feature modifications. We capture such settings as a game of Bayesian persuasion, in which the decision-maker sends a signal, i.e., an action recommendation, to the decision subject to incentivize them to take desirable actions. We formulate the principal's problem of finding the optimal Bayesian incentive-compatible signaling policy as an optimization problem and characterize it via a linear program. Through this characterization, we observe that while finding a BIC strategy can be simplified dramatically, the computational complexity of solving this linear program is closely tied to (1) the relative size of the agent's action space, and (2) the number of features utilized by the underlying decision rule.

Keegan Harris, Valerie Chen, Joon Kim, Ameet Talwalkar, Hoda Heidari, Steven Wu
Tue 1:10 p.m. - 1:40 p.m.
Emma Pierson (Keynote)
Emma Pierson
Tue 1:40 p.m. - 1:50 p.m.

School districts employing variations on the Gale–Shapley deferred acceptance algorithm assume that households have perfect information and list their preferences over schools truthfully. However, many families submit partial preference lists either by virtue of limited available resources or a misunderstanding of the mechanism. We investigate the role of defaults in deferred-acceptance towards alleviating search costs for families.

In San Francisco Unified School District (SFUSD), 11% of the 4,713 students were assigned using distance-based defaults in 2018-19. We study nine variations of the SFUSD assignment system, focusing on how defaults are constructed and how defaults are integrated algorithmically. We observe and discuss the change in the estimated welfare for different populations under the nine variations, and seek input on how to improve and evaluate our approach.

Amel Awadelkarim, Johan Ugander, Itai Ashlagi, Irene Lo
Tue 1:50 p.m. - 2:00 p.m.
Break
Tue 2:00 p.m. - 2:40 p.m.
Poster session I (Poster session) [ Visit Poster at Spot A0 in Virtual World ]
Tue 2:40 p.m. - 3:20 p.m.
Poster session II (Poster session) [ Visit Poster at Spot A0 in Virtual World ]
Tue 3:20 p.m. - 3:30 p.m.
Closing remarks
-

We propose neural-symbolic integration for abstract concept explanation and interactive learning. Neural-symbolic integration and explanation allow users and domain-experts to learn about the data-driven decision making process of large neural models. The models are queried using a symbolic logic language. Interaction with the user then confirms or rejects a revision of the neural model using logic-based constraints that can be distilled into the model architecture.

Benedikt Wagner, Artur Garcez
-

The ubiquity of AI leads to situations where humans and AI work together, creating the need for learning-to-defer algorithms that determine how to partition tasks between AI and humans. We work to improve learning-to-defer algorithms when paired with specific individuals by incorporating two fine-tuning algorithms and testing their efficacy using both synthetic and image datasets. We find that fine-tuning can pick up on simple human skill patterns, but struggles with nuance, and we suggest future work that uses robust semi-supervised to improve learning.

Naveen Raman, Michael Yee
-

In more and more situations, artificially intelligent algorithms have to model humans' (social) preferences on whose behalf they increasingly make decisions. They can learn these preferences through the repeated observation of human behavior in social encounters. In such a context, do individuals adjust the selfishness or prosociality of their behavior when it is common knowledge that their actions produce various externalities through the training of an algorithm? In an online experiment, we let participants' choices in dictator games train an algorithm. Thereby, they create an externality on future decision making of an intelligent system that affects future participants. We show that individuals who are aware of the consequences of their training on the well-being of a future generation behave more prosocially, but only when they bear the risk of being harmed themselves by future algorithmic choices. In that case, the externality of artificially intelligence training induces a significantly higher share of egalitarian decisions in the present.

Alicia von Schenk, Marie C Villeval, Victor Klockmann
-

We study the functional task of deep learning image classification models and show that image classification requires extrapolation capabilities. This suggests that new theories have to be developed for the understanding of deep learning as the current theory assumes models are solely interpolating, leaving many questions about them unanswered. We investigate the pixel space and also the feature spaces extracted from images by trained models (in their hidden layers, including the 64-dimensional feature space in the last hidden layer of ResNet), and also the feature space extracted by wavelets/shearlets. In all these domains, testing samples considerably fall outside the convex hull of training sets, and image classification requires extrapolation. Contrary to the deep learning literature, in cognitive science, psychology, and neuroscience, extrapolation and learning are often studied in tandem. Moreover, many aspects of human visual cognition and behavior are reported to involve extrapolation. We propose a novel extrapolation framework for the mathematical study of deep learning models. In our framework, we use the term extrapolation in this specific way of extrapolating outside the convex hull of training set (in the pixel space or feature space) but within the specific scope defined by the training data, the same way extrapolation is defined in many studies in cognitive science. We explain that our extrapolation framework can provide novel answers to open research problems about deep learning including their over-parameterization, their training regime, out-of-distribution detection, etc. We also see that the extent of extrapolation is negligible in learning tasks where deep learning is reported to have no advantage over simple models.

Roozbeh Yousefzadeh, Jessica Mollick
-

We demonstrate how representational similarity can be leveraged to improve accuracy in medical image decision-making. In a series of experiments conducted on novices and experts, we aggregate responses made by a single individual on similar images to improve overall accuracy on the task. The similarity between the two images was calculated as the euclidean distance between representations obtained from artificial neural networks. Across our experiments, we observed that this algorithm can make significant improvements for novices but not for experts, suggesting that both of them have different decision-making mechanisms. We observe that experts make similar decisions on similar images, unlike novices which indicate that experts are more biased in their errors whereas novices make errors more randomly.

Eeshan Hasan, Jennifer Trueblood, Quentin Eichbaum, Adam Seegmiller, Charles Stratton
-

A key aspect of human intelligence is their ability to convey their knowledge to others in succinct forms. However, despite their predictive power, current machine learning models are largely blackboxes, making it difficult for humans to extract useful insights. Focusing on sequential decision-making, we design a novel machine learning algorithm that conveys its insights to humans in the form of interpretable "tips". Our algorithm selects the tip that best bridges the gap in performance between human users and the optimal policy. We evaluate our approach through a series of randomized controlled user studies where participants manage a virtual kitchen. Our experiments show that the tips generated by our algorithm can significantly improve human performance relative to intuitive baselines. In addition, we discuss a number of empirical insights that can help inform the design of algorithms intended for human-AI interfaces. For instance, we find evidence that participants do not simply blindly follow our tips; instead, they combine them with their own experience to discover additional strategies for improving performance.

Hamsa Bastani, Osbert Bastani, Wichinpong Sinchaisri
-

Performance metric elicitation is a type of inverse decision problem where the goal is to learn a loss function for a classification problem using expert comparisons between candidate classifiers. However, for many practical tasks, such an expert can be noisy. We present an approach for learning performance metrics in this setting that can handle general noise models. Our approach takes advantage of the problem's similarity to probabilistic bisection search and uses pairwise comparisons to update a pseudo-belief distribution for the performance metric. Our theoretical results guarantee convergence in practical settings and extend beyond previous results to include multi-expert elicitation. Quantitative comparisons against prior work demonstrate the superiority of our approach.

Zachary Robertson, Hantao Zhang, Sanmi Koyejo
-

Delayed rewards, which are separated from their causative actions by irrelevant actions, hamper learning in reinforcement learning (RL). Especially real world problems often contain such delayed and sparse rewards. Recently, return decomposition for delayed rewards (RUDDER) employed pattern recognition to remove or reduce delay in rewards, which dramatically simplifies the learning task of the underlying RL method. RUDDER was realized using a long short-term memory (LSTM). The LSTM was trained to identify important state-action pair patterns, responsible for the return. Reward was then redistributed to these important state-action pairs. However, training the LSTM is often difficult and requires a large number of episodes. In this work, we replace the LSTM with the recently proposed continuous modern Hopfield networks (MHN) and introduce Hopfield-RUDDER. MHN are powerful trainable associative memories with large storage capacity. They require only few training samples and excel at identifying and recognizing patterns. We use this property of MHN to identify important state-action pairs that are associated with low or high return episodes and directly redistribute reward to them. However, in partially observable environments, Hopfield-RUDDER requires additional information about the history of state-action pairs. Therefore, we evaluate several methods for compressing history and introduce reset-max history, a lightweight history compression using the max-operator in combination with a reset gate. We experimentally show that Hopfield-RUDDER is able to outperform LSTM-based RUDDER on various 1D environments with small numbers of episodes. Finally, we show in preliminary experiments that Hopfield-RUDDER scales to highly complex environments with the Minecraft ObtainDiamond task from the MineRL NeurIPS challenge.

Michael Widrich, Markus Hofmarcher, Vihang Patil, Angela Bitto, Sepp Hochreiter
-

The forum r/AmITheAsshole in Reddit hosts discussion on moral issues based on concrete narratives presented by users. Existing analysis of the forum focuses on its comments, and does not make the underlying data publicly available. In this paper we build a new dataset of comments and also investigate the classification of the posts in the forum. Further, we identify textual patterns associated with the provocation of moral judgement by posts, with the expression of moral stance in comments, and with the decisions of trained classifiers of posts and comments.

Ion Stagkos Efstathiadis, Guilherme Paulino-Passos, Francesca Toni
-

We conduct a lottery experiment to assess the predictive importance of simple choice process metrics (SCPMs) in forecasting risky 50/50 gambling decisions using different types of machine learning algorithms in addition to traditional choice modeling approaches. The SCPMs are recorded during a fixed pre-decision phase and are derived from tracking subjects’ eye movements, pupil sizes, skin conductance, and cardiovascular and respiratory signals. Our study demonstrates that SCPMs provide relevant information for predicting gambling decisions; however, we do not find forecasting accuracy to be substantially affected by adding SCPMs to standard choice data. Instead, our results show that forecasting accuracy highly depends on differences in subject-specific risk preferences and is largely driven by including information on lottery design variables. As a key result, we find evidence for dynamic changes in the predictive importance of psychophysiological responses that appear to be linked to habituation and resource-depletion effects. Subjects’ willingness to gamble and choice-revealing arousal signals both decrease as the experiment progresses.

Steffen Mueller, Patrick Ring, Maria Fischer
-

This paper explores the possibility that the RL algorithm can control human goal-directed learning at both behavioral and neural levels. The proposed framework is based on an asymmetric two-player game setting: while a computational model of human RL (a cognitive model) performs a goal-conditioned two-stage Markov decision task, an RL algorithm (a task controller) learns a behavioral policy to drive the key variable (i.e., state prediction error) of the cognitive model to the arbitrarily chosen state, by manipulating the task parameters (i.e., state-action-state transition uncertainty and goal conditions) on a trial-by-trial basis. We fitted the cognitive models individually to 82 human subjects' data, and subsequently used them to train the task controller in two different scenarios, minimizing and maximizing state prediction error, each of which is known to improve and reduce the motivation for goal-directed learning, respectively. The model permutation analysis revealed a subject-independent task control policy, suggesting that the task controller pre-trained with cognitive models in-silico could generalize to actual human subjects without further training. To directly test the efficacy of our framework, we ran fMRI experiments on another 21 human subjects. The behavioral analysis confirmed that the pre-trained task controller successfully manipulates human goal-directed learning. Notably, we found neural effects of the task control on the insular and lateral prefrontal cortex, the cortical regions known to encode state prediction error signals during goal-directed learning. Our framework can be implemented with any RL algorithm, making it possible to guide various types of human-computer interaction.

Jaehoon Shin, Jee Hang Lee, Sang Wan Lee
-

Despite AI’s superhuman performance in a variety of domains, humans are often unwilling to adopt algorithms. The lack of interpretability inherent in many modern-day AI techniques is believed to be hurting algorithm adoption, as users may not trust systems whose decision processes they don’t understand. We investigate this proposition with a novel experiment in which we use an interactive prediction task to analyze the impact of interpretability as well as outcome feedback on trust in AI and performance in the prediction task. We find that interpretability led to no robust improvements in trust, while outcome feedback had a significantly greater and more reliable effect. However, neither factor had more than minimal effects on performance in the task. Our findings suggest that (1) factors receiving significant attention, such as interpretability, may be less effective at increasing trust than factors like outcome feedback, and (2) augmenting human performance via AI systems may not be a simple matter of increasing trust in AI, as increased trust is not always associated with equally sizable improvements in performance. These findings clarify for companies and product designers that providing interpretations alone may not be sufficient to solve challenges around user trust in AI products, while also highlighting that certain other features may be more effective. These findings also invite the research community to not only focus on methods for generating interpretations but also on methods for ensuring that interpretations impact trust and performance in practice, such as how to present interpretations to users.

Daehwan Ahn, Abdullah Almaatouq, Monisha Gulabani, Kartik Hosanagar
-

Measures such as conditional value-at-risk (CVaR) precisely characterize the influence that rare, catastrophic, events can exert over decisions. CVaR compounds in complex ways over sequences of decisions -- by averaging out or multiplying -- formalized in recent work [1] as three structurally different approaches. Existing cognitive tasks fail to discriminate these approaches well; here, we provide examples that highlight their unique characteristics, and make formal links to temporal discounting for the two of the approaches that are time-consistent. These examples can serve as a basis for future experiments with the broader aim of characterizing (potentially maladaptive) risk attitudes in psychopathological populations.

Christopher Gagne
-

Past research has clearly established that music can affect mood and that mood affects emotional and cognitive processing, and thus decision-making. It follows that if a robot interacting with a person needs to predict the person's behavior, knowledge of the music the person is listening to when acting is a potentially relevant feature. To date, however, there has not been any concrete evidence that an autonomous agent can improve its human-interactive decision-making by taking into account what the person is listening to. This research fills this gap by reporting the results of an experiment in which human participants were required to complete a task in the presence of an autonomous agent while listening to background music. Specifically, the participants drove a simulated car through an intersection while listening to music. The intersection was not empty, as another simulated vehicle, controlled autonomously, was also crossing the intersection in a different direction. Our results clearly indicate that such background information can be effectively incorporated in an agent's world representation in order to better predict people's behavior. We subsequently analyze how knowledge of music impacted both participant behavior and the resulting learned policy.

Elad Liebman, Peter Stone
-

Psychometric functions characterize binary sensory decisions along a single stimulus dimension. However, real-life sensory tasks vary along a greater variety of dimensions (e.g. color, contrast and luminance for visual stimuli). Approaches to characterizing high dimensional sensory spaces either require strong parametric assumptions about these additional contextual dimensions, or fail to leverage known properties of classical psychometric curves, such as identifiable thresholds and slopes. We overcome both limitations by introducing a semi-parametric model of sensory discrimination that parameterizes performance along a single intensity dimension via a classical logistic function, but uses Gaussian Processes (GPs) to flexibly model logistic parameters across any number of non-intensity dimensions.The use of GPs additionally enables the use of adaptive sampling, avoiding the need for grid sampling or staircase methods, which are intractable in higher dimensions. We show that this semi-parametric method accurately identifies the true high-dimensional psychometric function in fewer samples than competing approaches and offers behaviorally interpretable parameters.

Stephen Keeley, Ben Letham, Michael Shvartsman
-

Modeling human decision making plays a fundamental role in the design of intelligent systems capable of rich interactions. In this paper we consider the task of choice prediction in settings with multiple alternatives. Cognitive models of decision making can successfully replicate and explain behavioral effects involving uncertainty and interactions among alternatives but are computationally intensive to train. ML approaches excel in terms of choice prediction accuracy, but fail to provide insights on the underlying preference reasoning. We study different degrees of integration of ML and cognitive models for this task. We show, via testing on behavioral data, that our hybrid approach, based on the integration of a neural network and the Multi-attribute Linear Ballistic Accumulator cognitive model, requires significantly less time to train, and allows to capture important cognitive parameters while maintaining similar accuracy to the pure ML approach.

Taher Rahgooy, Jennifer Trueblood, Brent Venable
-

Recent advances in eXplainable Artificial Intelligence have enabled Artificial Intelligence (AI) systems to describe their thought process to human users. Also, given the high performance of AI on i.i.d, test sets, it is interesting to study whether such AIs can work alongside humans and improve the accuracy of user decisions. We conduct a user study on 320 lay and 11 expert users to understand on the effectiveness of state-of-the-art attribution methods in assisting humans in ImageNet classification, Stanford Dogs fine-grained classification, and these two tasks but when the input image contains adversarial perturbations. We found that, overall, feature attribution is surprisingly not more effective than showing humans nearest training-set examples. On a hard task of fine-grained dog classification, presenting attribution maps to humans does not help, but instead hurts the performance of human-AI teams compared to AI alone. Our findings encourage the community to rigorously test their methods on downstream human-in-the-loop applications and to rethink the existing evaluation metrics.

Giang Nguyen, Anh Nguyen
-

AI tools intended to assist human decision-making must be able to model the latent and potentially noisy preferences of their users. Pairwise comparisons between alternative items are an efficient means by which preference data can be solicited from users. In the preference learning literature, Gaussian processes have been used to construct flexible, generalizable models of pairwise preferences, but a key drawback of these approaches is runtime: performing Gaussian process inference requires computing a matrix inversion in cubic time. In this work, we introduce a new method for training Gaussian process preference models based on neural networks, for which a forward pass requires only linear time. Our models use Siamese neural network architectures, which enable the prediction of both utility function valuations for individual items as well as pairwise preference probabilities. Using two popular benchmark datasets, we show that our models can achieve predictive accuracy competitive with existing preference learning methods while requiring only a fraction of the time for evaluation.

Rex Chen, Norman Sadeh, Fei Fang
-

In this paper we argue that, despite decades of research in Artificial Intelligence, we do not evaluate Machine Learning models in the right way - neither in research papers nor in industrial deployments. We show why this is the case, propose concepts and metrics that allow us to properly assess the value of a model, and provide insights into what we look for in a "good" ML model.

Fabio Casati, Pierre-André Noël, Jie Yang
-

Humans often use images to make high-stakes decisions. We propose a machine learning approach to analyze the ways in which they err in doing so, leveraging a unique dataset of 16,135,392 human predictions of whether a neighborhood voted for Donald Trump or Joe Biden in the 2020 US election, based on a Google Street View image. We show that by training a machine learning estimator of the Bayes optimal decision for each image, we can provide an actionable decomposition of human error into bias, variance, and noise terms and identify specific features (like pickup trucks) which lead humans astray.

J.D. Zamfirescu-Pereira, Jerry Chen, Emily Wen, Allison Koenecke, Nikhil Garg, Emma Pierson
-

Machine learning-based tools have drawn increasing interest from public policy practitioners, yet our understanding of the effectiveness of such tools when paired with human decision makers is limited. Using a randomized control trial, we evaluate the effects of an established algorithmic decision aid tool implemented by a U.S. child welfare agency. Halfway through the trial, this paper presents preliminary evidence on the effects of showing a child’s predicted risk on child welfare decision outcomes. Child welfare workers are already sensitive to underlying risk as measured by the algorithmic tool. Making the score available to workers, however, seems to improve even more the targeting of child welfare visits.

Marie-Pascale Grimon, Christopher Mills
-

When faced with (automated) assessment rules, individuals can modify their observable features strategically to obtain better decisions. In many situations, decision-makers deliberately keep the underlying predictive model secret to avoid gaming. This forces the decision subjects to rely on incomplete information when making strategic feature modifications. We capture such settings as a game of Bayesian persuasion, in which the decision-maker sends a signal, i.e., an action recommendation, to the decision subject to incentivize them to take desirable actions. We formulate the principal's problem of finding the optimal Bayesian incentive-compatible signaling policy as an optimization problem and characterize it via a linear program. Through this characterization, we observe that while finding a BIC strategy can be simplified dramatically, the computational complexity of solving this linear program is closely tied to (1) the relative size of the agent's action space, and (2) the number of features utilized by the underlying decision rule.

Keegan Harris, Valerie Chen, Joon Kim, Ameet Talwalkar, Hoda Heidari, Steven Wu
-

School districts employing variations on the Gale–Shapley deferred acceptance algorithm assume that households have perfect information and list their preferences over schools truthfully. However, many families submit partial preference lists either by virtue of limited available resources or a misunderstanding of the mechanism. We investigate the role of defaults in deferred-acceptance towards alleviating search costs for families.

In San Francisco Unified School District (SFUSD), 11% of the 4,713 students were assigned using distance-based defaults in 2018-19. We study nine variations of the SFUSD assignment system, focusing on how defaults are constructed and how defaults are integrated algorithmically. We observe and discuss the change in the estimated welfare for different populations under the nine variations, and seek input on how to improve and evaluate our approach.

Amel Awadelkarim, Johan Ugander, Itai Ashlagi, Irene Lo

Author Information

Daniel Reichman (Worcester Polytechnic Institute)
Joshua Peterson (Princeton University)
Kiran Tomlinson (Cornell University)
Annie Liang (UPenn)
Tom Griffiths (Princeton University)

More from the Same Authors