Timezone: »

Workshop
Workshop on Machine Learning Safety
Jacob Steinhardt · Victoria Krakovna · Dan Hendrycks · Nicholas Carlini · Dawn Song

Fri Dec 09 07:00 AM -- 02:00 PM (PST) @ Virtual

Designing systems to operate safely in real-world settings is a topic of growing interest in machine learning. As ML becomes more capable and widespread, long-term and long-tail safety risks will grow in importance. To make the adoption of ML more beneficial, various aspects of safety engineering and oversight need to be proactively addressed by the research community. This workshop will bring together researchers from machine learning communities to focus on research topics in Robustness, Monitoring, Alignment, and Systemic Safety.
* Robustness is designing systems to be reliable in the face of adversaries and highly unusual situations.
* Monitoring is detecting anomalies, malicious use, and discovering unintended model functionality.
* Alignment is building models that represent and safely optimize difficult-to-specify human values.
* Systemic Safety is using ML to address broader risks related to how ML systems are handled, such as cyberattacks, facilitating cooperation, or improving the decision-making of public servants.

 Fri 7:00 a.m. - 7:10 a.m. Opening Remarks (Speaker) 🔗 Fri 7:10 a.m. - 7:40 a.m. Invited talk #1 (Speaker) 🔗 Fri 7:40 a.m. - 8:25 a.m. Morning Poster Session (Poster Session) [] 🔗 Fri 8:25 a.m. - 8:45 a.m. Coffee Break (Break) 🔗 Fri 9:15 a.m. - 10:00 a.m. Afternoon Poster Session (Poster Session) [] 🔗 Fri 10:00 a.m. - 10:45 a.m. Lunch (Break) 🔗 Fri 10:45 a.m. - 11:15 a.m. Invited talk #3 (Speaker) 🔗 Fri 11:15 a.m. - 11:45 a.m. Invited talk #4 (Speaker) 🔗 Fri 11:45 a.m. - 12:00 p.m. Coffee Break (Speaker) 🔗 Fri 12:00 p.m. - 12:30 p.m. Invited talk #5 (Speaker) 🔗 Fri 12:30 p.m. - 1:00 p.m. Invited talk #6 (Speaker) 🔗 Fri 1:00 p.m. - 1:55 p.m. Live Panel Discussion with the Invited Speakers (Discussion Panel) 🔗 Fri 1:55 p.m. - 2:00 p.m. Closing Remarks (Speaker) 🔗 - Formalizing the Problem of Side Effect Regularization (Poster) []  AI objectives are often hard to specify properly. Some approaches tackle thisproblem by regularizing the AI’s side effects: Agents must weigh off “how muchof a mess they make” with an imperfectly specified proxy objective. We propose aformal criterion for side effect regularization via the assistance game framework[Shah et al., 2021]. In these games, the agent solves a partially observable Markovdecision process (POMDP) representing its uncertainty about the objective functionit should optimize. We consider the setting where the true objective is revealedto the agent at a later time step. We show that this POMDP is solved by tradingoff the proxy reward with the agent’s ability to achieve a range of future tasks.We empirically demonstrate the reasonableness of our problem formalization viaground-truth evaluation in two gridworld environments. Alex Turner · Aseem Saxena · Prasad Tadepalli 🔗 - Measuring Robustness with Black-Box Adversarial Attack using Reinforcement Learning (Poster) []  A measure of robustness against naturally occurring distortions is key to the trustworthiness, safety, and success of machine learning models on deployment. We investigate an adversarial black-box attack that adds minimum Gaussian noise distortions to input images to make deep learning models misclassify. We used a Reinforcement Learning (RL) agent as a smart hacker to explore the input images to add minimum distortions to the most sensitive regions to induce misclassification. The agent employs a smart policy also to remove noises introduced earlier, which has less impact on the trained model at a given state. This novel approach is equivalent to doing a deep tree search to add noises without an exhaustive search, leading to faster and optimal convergence. Also, this adversarial attack method effectively measures the robustness of image classification models with the misclassification inducing minimum L2 distortion of Gaussian noise similar to many naturally occurring distortions. Furthermore, the proposed black-box L2 adversarial attack tool beats state-of-the-art competitors in terms of the average number of queries by a significant margin with a 100\% success rate while maintaining a very competitive L2 score, despite limiting distortions to Gaussian noise. For the ImageNet dataset, the average number of queries achieved by the proposed method for ResNet-50, Inception-V3, and VGG-16 models are 42%, 32%, and 31% better than the state-of-the-art "Square-Attack" approach while maintaining a competitive L2.Demo: https://tinyurl.com/pzrca5fj Soumyendu Sarkar · Sajad Mousavi · Ashwin Ramesh Babu · Vineet Gundecha · Sahand Ghorbanpour · Alexander Shmakov 🔗 - Investigating causal understanding in LLMs (Poster) We investigate the quality of causal world models of LLMs in very simple settings. We test whether LLMs can identify cause and effect in natural language settings (taken from BigBench) such as “My car got dirty. I washed the car. Question: Which sentence is the cause of the other?” and in multiple other toy settings. We probe the LLM's world model by changing the presentation of the prompt while keeping the meaning constant, e.g. by changing the order of the sentences or asking the opposite question. Additionally, we test if the model can be “tricked” into giving wrong answers when we present the shot in a different pattern than the prompt. We have three findings. Firstly, larger models yield better results. Secondly, k-shot outperforms one-shot and one-shot outperforms zero-shot in standard conditions. Thirdly, LLMs perform worse in conditions where form and content differ. We conclude that the form of the presentation matters for LLM predictions or, in other words, that LLMs don't solely base their predictions on content. Finally, we detail some of the implications this research has on AI safety. Marius Hobbhahn · Tom Lieberum · David Seiler 🔗 - Reflection Mechanisms as an Alignment Target: A Survey (Poster) We used Positly to survey roughly 1000 US-based workers about their attitudes on moral questions, conditions under which they would change their moral beliefs, and approval towards different mechanisms for society to resolve moral disagreements. Unsurprisingly, our sample strongly disagreed on contentious object-level moral questions such as whether abortion is immoral. In addition, a substantial fraction of people reported that these beliefs wouldn’t change even if they came to different beliefs about factors we view as morally relevant, such as whether the fetus was conscious in the case of abortion. However, people were generally favorable to the idea of society deciding policies by some means of reflection - such as democracy, a debate between well-intentioned experts, or thinking for a long time. This agreement improves in a hypothetical well-intentioned future society. Surprisingly, favorability remained even when we stipulate that the reflection procedure came to the opposite of the respondents' view on polarizing topics like abortion. This provides evidence that people may support aligning AIs to a reflection procedure rather than individual beliefs. We tested our findings on a second adversarial survey that actively tries to disprove the finding from the first study. We find that our core results are robust in standard settings but are weakened when the questions are constructed adversarially (e.g. when decisions are made by people who have the opposite of the respondents' moral or political beliefs). Marius Hobbhahn · Eric Landgrebe · Elizabeth Barnes 🔗 - Interpolating Compressed Parameter Subspaces (Poster) Though distribution shifts have caused growing concern for machine learning scalability, solutions tend to specialize towards a specific type of distribution shift. We learn that constructing a Compressed Parameter Subspaces (CPS), a geometric structure representing distance-regularized parameters mapped to a set of train-time distributions, can maximize average accuracy over a broad range of distribution shifts concurrently. We show sampling parameters within a CPS can mitigate backdoor, adversarial, permutation, stylization and rotation perturbations. Regularizing a hypernetwork with CPS can also reduce task forgetting. Siddhartha Datta · Nigel Shadbolt 🔗 - Probabilistically Robust PAC Learning (Poster) []  Recently, Robey et al. propose a notion of probabilistic robustness, which, at a high-level, requires a classifier to be robust to most but not all perturbations. They show that for certain hypothesis classes where proper learning under worst-case robustness is \textit{not} possible, proper learning under probabilistic robustness \textit{is} possible with sample complexity exponentially smaller than in the worst-case robustness setting. This motivates the question of whether proper learning under probabilistic robustness is always possible. In this paper, we show that this is \textit{not} the case. We exhibit examples of hypothesis classes $\mathcal{H}$ with finite VC dimension that are \textit{not} probabilistically robustly PAC learnable with \textit{any} proper learning rule. Vinod Raman · Ambuj Tewari · UNIQUE SUBEDI 🔗 - Multiple Remote Adversarial Patches: Generating Patches based on Diffusion Models for Object Detection using CNNs (Poster) []  Adversarial patches can fool object detection systems, which poses a severe threat to machine learning models. Many researchers have focused on strong adversarial patches. Remote adversarial patches, placed outside the target objects, are candidates of strong adversarial patches. This study gives a concrete model of adversarial patches on convolutional neural networks (CNNs), namely diffusion model. Our diffusion model shows that multiple remote adversarial patches pose severe threats on YOLOv2 CNN. Our experiment also demonstrates that two remote adversarial patches reduce the average existence probability to 12.81%, whereas Saha et al.'s original single adversarial patch reduced the average existence probability to 50.95%. Moreover, we generate adversarial patches on SSD architecture. In SSD architecture, two remote adversarial patches also significantly reduce the average existence probability from 24.52% to 6.12%. By the above results, this paper provides a framework for analyzing the effect of adversarial patch attacks. Kento Oonishi · Tsunato Nakai · Daisuke Suzuki 🔗 - Misspecification in Inverse Reinforcement Learning (Poster) []  The aim of Inverse Reinforcement Learning (IRL) is to infer a reward function $R$ from a policy $\pi$. To do this, we need a model of how $\pi$ relates to $R$. In the current literature, the most common models are \emph{optimality}, \emph{Boltzmann rationality}, and \emph{causal entropy maximisation}. One of the primary motivations behind IRL is to infer human preferences from human behaviour. However, the true relationship between human preferences and human behaviour is much more complex than any of the models currently used in IRL. This means that they are \emph{misspecified}, which raises the worry that they might lead to unsound inferences if applied to real-world data.In this paper, we provide a mathematical analysis of how robust different IRL models are to misspecification, and answer precisely how the demonstrator policy may differ from each of the standard models before that model leads to faulty inferences about the reward function $R$. We also introduce a framework for reasoning about misspecification in IRL, together with formal tools that can be used to easily derive the misspecification robustness of new IRL models. Joar Skalse · Alessandro Abate 🔗 - Red-Teaming the Stable Diffusion Safety Filter (Poster) []  Stable Diffusion is a recent open-source image generation model comparable to proprietary models such as DALL·E, Imagen, or Parti. Stable Diffusion comes with a safety filter that aims to prevent generating explicit images. Unfortunately, the filter is obfuscated and poorly documented. This makes it hard for users to prevent misuse in their applications, and to understand the filter's limitations and improve it. We first show that it is easy to generate disturbing content that bypasses the safety filter. We then reverse-engineer the filter and find that while it aims to prevent sexual content, it ignores violence, gore, and other similarly disturbing content. Based on our analysis, we argue safety measures in future model releases should strive to be fully open and properly documented to stimulate security contributions from the community. Javier Rando · Daniel Paleka · David Lindner · Lennart Heim · Florian Tramer 🔗 - Tracking the Risk of Machine Learning Systems with Partial Monitoring (Poster) []  Although efficient at performing specific tasks, Machine Learning Systems (MLSs) remain vulnerable to instabilities such as noise or adversarial attacks. In this work, we aim to track the risk exposure of an MLS to these events. We formulate this problem under the stochastic Partial Monitoring (PM) setting. We focus on two instances of partial monitoring, namely the Apple Tasting and Label Efficient games, that are particularly relevant to our problem. Our review of the practicality of existing algorithms motivates RandCBP, a randomized variation of the deterministic algorithm Confidence Bound (CBP) inspired by recent theoretical developments in the bandits setting. Our preliminary results indicate that RandCBP enjoys the same regret guarantees as its deterministic counterpart CBP and achieves competitive empirical performance on settings of interest which suggests it could be a suitable candidate for our problem. Maxime Heuillet · Audrey Durand 🔗 - The Reward Hypothesis is False (Poster) []  The \emph{reward hypothesis} is the hypothesis that \enquote{all of what we mean by goals and purposes can be well thought of as the maximisation of the expected value of the cumulative sum of a received scalar signal}\citep{sutton2018reinforcement}.In this paper, we will argue that this hypothesis is false.We will look at three natural classes of reinforcement learning tasks (multi-objective reinforcement learning, risk-averse reinforcement learning, and modal reinforcement learning), and then prove mathematically that these tasks cannot be expressed using any scalar, Markovian reward function. We thus disprove the reward hypothesis by providing many examples of tasks which are both natural and intuitive to describe, but which are nonetheless impossible to express using reward functions.In the process, we provide necessary and sufficient conditions for when a multi-objective reinforcement learning problem can be reduced to ordinary, scalar reward reinforcement learning. We also call attention to a new class of reinforcement learning problems (namely those we call \enquote{modal} problems), which have so far not been given any systematic treatment in the reinforcement learning literature. Joar Skalse · Alessandro Abate 🔗 - Training Time Adversarial Attack Aiming the Vulnerability of Continual Learning (Poster) []  Generally, regularization-based continual learning models limit access to the previous task data to imitate the real-world setting which has memory and privacy issues.However, this introduces a problem in these models by not being able to track the performance on each task.In other words, current continual learning methods are vulnerable to attacks done on the previous task.We demonstrate the vulnerability of regularization-based continual learning methods by presenting simple task-specific training time adversarial attack that can be used in the learning process of a new task.Training data generated by the proposed attack causes performance degradation on a specific task targeted by the attacker.Experiment results justify the vulnerability proposed in this paper and demonstrate the importance of developing continual learning models that are robust to adversarial attack. Gyojin Han · Jaehyun Choi · HyeongGwon Hong · Junmo Kim 🔗 - Measuring Reliability of Large Language Models through Semantic Consistency (Poster) []  While large pretrained language models (PLMs) demonstrate incredible fluency and performance on many natural language tasks, recent work has shown that well-performing PLMs are very sensitive to what prompts are feed into them. Even when prompts are semantically identical, language models may give very different answers. When considering safe and trustworthy deployments of PLMs we would like their outputs to be consistent under prompts that mean the same thing or convey the same intent. While some work has looked into how state-of-the-art PLMs address this need, they have been limited to only evaluating lexical equality of single- or multi-word answers and do not address consistency of generative text sequences. In order to understand consistency of PLMs under text generation settings, we develop a measure of semantic consistency that allows the comparison of open-ended text outputs. We implement several versions of this consistency metric to evaluate the performance of a number of PLMs on paraphrased versions of questions in the TruthfulQA dataset, we find that our proposed metrics are considerably more consistent than traditional metrics embodying lexical consistency, and also correlate with human evaluation of output consistency to a higher degree. Harsh Raj · Domenic Rosati · Subhabrata Majumdar 🔗 - CUDA: Curriculum of Data Augmentation for Long-tailed Recognition (Poster) []  Class imbalance problems frequently occur in real-world tasks, and conventional deep learning algorithms are well known for performance degradation on imbalanced training datasets. To mitigate this problem, many approaches have aimed to balance among given classes by re-weighting or re-sampling training samples. These re-balancing methods increase the impact of minority classes and reduce the influence of majority classes on the output of models. Despite extensive recent studies, no deep analysis has been conducted on determination of classes to be augmented and strength of augmentation has been conducted. In this study, we propose a simple and efficient novel curriculum, which is designed to find the appropriate per-class strength of data augmentation, called CUDA: CUrriculum of Data Augmentation for long-tailed recognition. CUDA can simply be integrated into existing long-tailed recognition methods. We present the results of experiments showing that CUDA effectively achieves better generalization performance compared to the state-of-the-art method on imbalanced datasets such as CIFAR-100-LT. Sumyeong Ahn · Jongwoo Ko · Se-Young Yun 🔗 - Certified defences hurt generalisation (Poster) []  In recent years, much work has been devoted to designing certifieddefences for neural networks, i.e., methods for learning neuralnetworks that are provably robust to certain adversarialperturbations. Due to the non-convexity of the problem, dominantapproaches in this area rely on convex approximations, which areinherently loose. In this paper, we question the effectiveness of suchapproaches for realistic computer vision tasks. First, we provideextensive empirical evidence to show that certified defences suffernot only worse accuracy but also worse robustness and fairness thanempirical defences. We hypothesise that the reason for why certifieddefences suffer in generalisation is (i) the large number ofrelaxed non-convex constraints and (ii) strong alignment between theadversarial perturbations and the "signal" direction. We provide acombination of theoretical and experimental evidence to support thesehypotheses. Piersilvio De Bartolomeis · Jacob Clarysse · Fanny Yang · Amartya Sanyal 🔗 - Discovering Bugs in Vision Models using Off-the-shelf Image Generation and Captioning (Poster) []  Automatically discovering failures in vision models under real-world settings remains an open challenge. This work describes how off-the-shelf, large-scale, image-to-text and text-to-image models, trained on vast amounts of data, can be leveraged to automatically find such failures. We detail a pipeline that demonstrates how we can interrogate classifiers trained on ImageNet to find specific failure cases and discover spurious correlations. We also show that we can scale our approach to generate adversarial datasets targeting specific classifier architectures. This work serves as a proof-of-concept demonstrating the utility of large-scale generative models to automatically discover bugs in vision models in an open-ended manner. Olivia Wiles · Isabela Albuquerque · Sven Gowal 🔗 - Context-Adaptive Deep Neural Networks via Bridge-Mode Connectivity (Poster) []  The deployment of machine learning models in safety-critical applications comes with the expectation that such models will perform well over a range of contexts (e.g., a vision model for classifying street signs should work in rural, city, and highway settings under varying lighting/weather conditions). However, these one-size-fits-all models are typically optimized for average case performance, encouraging them to achieve high performance in nominal conditions but exposing them to unexpected behavior in challenging or rare contexts. To address this concern, we develop a new method for training context-dependent models. We extend Bridge-Mode Connectivity (BMC) to train an infinite ensemble of models over a continuous measure of context such that we can sample model parameters specifically tuned to the corresponding evaluation context. We explore the definition of context in image classification tasks through multiple lenses including changes in the risk profile, long-tail image statistics/appearance, and context-dependent distribution shift. We develop novel extensions of the BMC optimization for each of these cases and our experiments demonstrate that model performance can be successfully tuned to context in each scenario. Nathan Drenkow · Alvin Tan · Clayton Ashcraft · Kiran Karra 🔗 - Constraining Low-level Representations to Define Effective Confidence Scores (Poster) Neural networks are known to fail with high confidence, especially for data that somehow differs from the training distribution. Such an unsafe behaviour limits their applicability. To counter that, we show that models offering accurate confidence levels can be defined via adding constraints in their internal representations. To do so, we encode class labels as fixed unique binary vectors, or class codes, and use those to enforce class-dependent activation patterns throughout the model's depth. Resulting predictors are dubbed total activation classifiers (TAC), and TAC is used as an additional component to a base classifier to indicate how reliable a prediction is. Empirically, we show that the resemblance between activation patterns and their corresponding codes results in an inexpensive unsupervised approach for inducing discriminative confidence scores. Namely, we show that TAC is at least as good as state-of-the-art confidence scores extracted from existing models, while strictly improving the model's value on the rejection setting. Joao Monteiro · Pau Rodriguez · Pierre-Andre Noel · Issam Hadj Laradji · David Vázquez 🔗 - On the Robustness of Safe Reinforcement Learning under Observational Perturbations (Poster) []  Safe reinforcement learning (RL) trains a policy to maximize the task reward while satisfying safety constraints. While prior works focus on performance optimality, we find that the optimal solutions of many safe RL problems are not robust and safe against observational perturbations.We formally analyze the unique properties of designing effective state adversarial attackers in the safe RL setting. We show that baseline adversarial attack techniques for standard RL tasks are not always effective for safe RL and proposed two new approaches - one maximizes the cost and the other maximizes the reward. One interesting and counter-intuitive finding is that the maximum reward attack is strong, as it can both induce unsafe behaviors and make the attack stealthy by maintaining the reward.We further propose a more effective adversarial training framework for safe RL and evaluate it via comprehensive experiments (video demos are available at: \url{https://sites.google.com/view/robustsaferl/home).This paper provides a pioneer work to investigate the safety and robustness of RL under observational attacks for future safe RL studies. ZUXIN LIU · Zijian Guo · Zhepeng Cen · Huan Zhang · Jie Tan · Bo Li · DING ZHAO 🔗 - Improving Zero-shot Generalization and Robustness of Multi-modal Models (Poster) []  Multi-modal image-text models such as CLIP and LiT have demonstrated impressive performance on image classification benchmarks and their zero-shot generalization ability is particularly exciting. While the top-5 zero-shot accuracies of these models are very high, the top-1 accuracies are much lower (over 25% gap in some cases). We investigate the reason for this performance gap and find that many of the failure cases are caused by ambiguity in the text prompts. First,we develop a simple and efficient zero-shot post-hoc method to identify images where the top-1 prediction is likely to be incorrect, by measuring consistency of the predictions w.r.t. multiple prompts and image transformations. We show that our procedure better predicts mistakes, outperforming the popular max logit baseline on selective prediction tasks. Next, we propose a simple and efficient way to improve accuracy on such uncertain images by making use of the WordNet hierarchy; specifically we use information from parents in the hierarchy to add superclass to prompts, and use information from children in the hierarchy to devise fine-grained prompts. We conduct experiments on both CLIP and LiT models with five different ImageNet-based datasets. For CLIP, our method improves the top-1 accuracy by 17.13% on the uncertain subset and 3.6% on the entire ImageNet validation set. We also show that our method consistently improvement on other ImageNet shifted datasets and other model architectures such as LiT. Our proposed method is hyperparameter-free, requires no additional model training and can be easily scaled to other large multi-modal architectures. Code for our experiments is opensourced at link hidden for anonymity. Yunhao Ge · Jie Ren · Ming-Hsuan Yang · Yuxiao Wang · Andrew Gallagher · Hartwig Adam · Laurent Itti · Balaji Lakshminarayanan · Jiaping Zhao 🔗 - Disclosing the Biases in Large Language Models via Reward Structured Questions (Poster) The success of the large language models have been utterly demonstrated in the recent time. Using these models and fine tuning for the specific task at hand results in highly performing models. However, these models also learn biased representations from the data they have been trained on. In particular, several studies recently showed that language models can learn to be biased towards certain genders. Quite recently, several studies tried to eliminate this bias via proposing human feedback included in fine-tuning. In our study we show that by changing the question asked to the language model the log probabilities of the bias measured in the responses changes dramatically. Furthermore, in several cases the language model ends up providing a completely opposite response. The recent language models finetuned on the prior gender bias datasets do not resolve the actual problem, but rather alleviates the problem for the dataset on which the model is fine-tuned. We believe our results might lay the foundation for further alignment and safety problems in large language models. Ezgi Korkmaz 🔗 - Dynamic Stochastic Ensemble with Adversarial Robust Lottery Ticket Subnetworks (Poster) []  Adversarial attacks are considered the intrinsic vulnerability of CNNs. Defense strategies designed for attacks have been stuck in the adversarial attack-defense arms race, reflecting the imbalance between attack and defense. Dynamic Defense Framework (DDF) recently changed the passive safety status quo based on the stochastic ensemble model. The diversity of subnetworks, an essential concern in the DDF, can be effectively evaluated by the adversarial transferability between different networks. Inspired by the poor adversarial transferability between subnetworks of scratch tickets with various remaining ratios, we propose a method to realize the dynamic stochastic ensemble defense strategy. We discover the adversarial transferable diversity between robust lottery ticket subnetworks drawn from different basic structures and sparsity. The experimental results suggest that our method achieves better robust and clean recognition accuracy by adversarial transferable diversity, which would decrease the reliability of attacks. Qi Peng · Wenlin Liu · Qin RuoXi · Libin Hou · Bin Yan · Linyuan Wang 🔗 - Exploring Transformer Backbones for Heterogeneous Treatment Effect Estimation (Poster) Previous works on Treatment Effect Estimation (TEE) are not in widespread use because they are predominantly theoretical, where strong parametric assumptions are made but untractable for practical application. Recent works use Multilayer Perceptron (MLP) for modeling casual relationships, however, MLPs lag far behind recent advances in ML methodology, which limits their applicability and generalizability. To extend beyond the single domain formulation and towards more realistic learning scenarios, we explore model design spaces beyond MLPs, i.e., transformer backbones, which provide flexibility where attention layers govern interactions among treatments and covariates to exploit structural similarities of potential outcomes for confounding control. Through careful model design, Transformers as Treatment Effect Estimators (TransTEE) is proposed. We show empirically that TransTEE can: (1) serve as a general-purpose treatment effect estimator which significantly outperforms competitive baselines on a variety of challenging TEE problems (e.g., discrete, continuous, structured, or dosage-associated treatments.) and is applicable to both when covariates are tabular and when they consist of structural data (e.g., texts, graphs); (2) yield multiple advantages: compatibility with propensity score modeling, parameter efficiency, robustness to continuous treatment value distribution shifts, explainable in covariate adjustment, and real-world utility in auditing pre-trained language models. yifan zhang · Hanlin Zhang · Zachary Lipton · Li Erran Li · Eric Xing 🔗 - Bandits with Costly Reward Observations (Poster) Many Machine Learning applications are based on clever ways of constructing a large dataset from already existing sources in order to avoid the cost of labeling training examples. However, in settings such as content moderation with rapidly changing distributions without automatic ground-truth feedback, this may not be possible. If an algorithm has to pay for reward information, for example by asking a person for feedback, how does this change the exploration/exploitation tradeoff? We study Bandits with Costly Reward Observations, where a cost needs to be paid in order to observe the reward of the bandit's action. We show the impact of the observation cost on the regret by proving an $\Omega(c^{1/3}T^{2/3})$ lower bound, present a general non-adaptive algorithm which matches the lower bound, and present several competitive adaptive algorithms. Aaron Tucker · Caleb Biddulph · Claire Wang · Thorsten Joachims 🔗 - RobustAugMix: Joint Optimization of Natural and Adversarial Robustness (Poster) []  Machine learning models often suffer performance degradation when faced with corrupted data. In this work, we explore a technique that combines a data augmentation strategy (AugMix) with adversarial training, in order to increase robustness to both natural and adversarial forms of data corruption. Josue Martinez-Martinez · Olivia Brown 🔗 - Pre-training Robust Feature Extractor Against Clean-label Data Poisoning Attacks (Poster) []  In the transfer learning paradigm, models pre-trained on large datasets are employed as foundation models in various downstream tasks. However, this paradigm exposes downstream practitioners to data poisoning threats. Poisoning attackers craft malicious samples on foundation models, then inject these samples into re-training datasets to manipulate the behaviors of models at inference. In this work, we propose an upstream defense strategy that significantly reduces the success rate of various data poisoning attacks. Our defense aims to pre-train robust foundation models by reducing adversarial feature distance and increasing inter-categories feature distance. Experiments demonstrate the excellent defense performance of the proposed strategy towards state-of-the-art clean-label attacks in the transfer learning setting. Ting Zhou · Hanshu Yan · Lei LIU · Jingfeng Zhang · Bo Han 🔗 - MoAT: Meta-Evaluation of Anti-Malware Trustworthiness (Poster) []  Many studies have proposed methods for the automated detection of malware. The benchmarks used for evaluating these methods often vary, hindering a trustworthy comparative analysis of models. We analyzed the evaluation criteria of over 100 malware detection methods from 2018-2022 in order to understand the current state of malware detection. From our study, we devised several criteria for benchmarking future malware detection methods. Our findings indicate that a finer-grained class balance in datasets is necessary to ensure the robustness of models. In addition, a metric robust to distribution shifts, e.g. AUC, should be used in future studies to prevent the inflation of results in unrealistic distribution regimes. The composition of datasets should also be disclosed in order to ensure a fair comparison of models. To our knowledge, this study is the first to assess the trustworthiness of evaluations from multi-domain malware detection methods. Sharon Lin · Marc Fyrbiak · Christof Paar 🔗 - Cold Posteriors through PAC-Bayes (Poster) []  We investigate the cold posterior effect through the lens of PAC-Bayes generalization bounds. We argue that in the non-asymptotic setting, when the number of training samples is (relatively) small, discussions of the cold posterior effect should take into account that approximate Bayesian inference does not readily provide guarantees of performance on out-of-sample data. Instead, out-of-sample error is better described through a generalization bound. In this context, we explore the connections of the ELBO objective from variational inference and the PAC-Bayes objectives. We note that, while the ELBO and PAC-Bayes objectives are similar, the latter objectives naturally contain a temperature parameter $\lambda$ which is not restricted to be $\lambda=1$. For realistic classification tasks, in the case of Laplace approximations to the posterior, we show how this PAC-Bayesian interpretation of the temperature parameter captures important aspects of the cold posterior effect. Konstantinos Pitas · Julyan Arbel 🔗 - How Sure to Be Safe? Difficulty, Confidence and Negative Side Effects (Poster) []  A principal concern for AI systems is the occurrence of negative side effects, such as a robot cleaner breaking a vase. This is critical when these systems use machine learning models that were trained to maximise performance, without knowledge or feedback about the negative side effects. Within Vase World and SafeLife, two safety benchmarking domains, we analyse side effects during operation and demonstrate that their magnitude is influenced by task difficulty. Using two forms of confidence measure, we demonstrate that wrapping existing RL agents with these confidence measures enables with safety policies that activate when the agent's confidence falls below a specified threshold extends the Pareto frontier of both performance and safety. John Burden · José Hernández-Orallo · Sean O hEigeartaigh 🔗 - Towards Defining Deception in Structural Causal Games (Poster) []  Deceptive agents are a challenge for the safety, trustworthiness, and cooperation ofAI systems. We focus on the problem that agents might deceive in order to achievetheir goals. There are a number of existing definitions of deception in the literatureon game theory and symbolic AI, but there is no overarching theory of deceptionfor learning agents in games. We introduce a functional definition of deceptionin structural causal games, grounded in the philosophical literature. We presentseveral examples to establish that our formal definition captures philosophical andcommonsense desiderata for deception. Francis Ward 🔗 - System Safety Engineering for Social and Ethical ML Risks: A Case Study (Poster) Governments, industry, and academia have undertaken efforts to identify and mitigate harms in ML-driven systems, with a particular focus on social and ethical risks of ML components in complex sociotechnical systems. However, existing approaches are largely disjointed, ad-hoc and of unknown effectiveness. Systems safety engineering is a well established discipline with a track record of identifying and managing risks in many complex sociotechnical domains. We adopt the natural hypothesis that tools from this domain could serve to enhance risk analyses of ML in its context of use. To test this hypothesis, we apply a best of breed'' systems safety analysis, Systems Theoretic Process Analysis (STPA), to a specific high-consequence system with an important ML-driven component, namely the Prescription Drug Monitoring Programs (PDMPs) operated by many US States, several of which rely on an ML-derived risk score. We focus in particular on how this analysis can extend to identifying social and ethical risks and developing concrete design-level controls to mitigate them. Edgar Jatho · Logan Mailloux · Shalaleh Rismani · Eugene Williams · Joshua Kroll 🔗 - Quantifying Misalignment Between Agents (Poster) []  Growing concerns about the AI alignment problem have emerged in recent years, with previous work focusing mostly on (1) qualitative descriptions of the alignment problem; (2) attempting to align AI actions with human interests by focusing on value specification and learning; and/or (3) focusing on either a single agent or on humanity as a singular unit. However, the field as a whole lacks a systematic understanding of how to specify, describe and analyze misalignment among entities, which may include individual humans, AI agents, and complex compositional entities such as corporations, nation-states, and so forth. Prior work on controversy in computational social science offers a mathematical model of contention among populations (of humans). In this paper, we adapt this contention model to the alignment problem, and show how viewing misalignment can vary depending on the population of agents (human or otherwise) being observed as well as the domain or "problem area" in question. Our model departs from value specification approaches and focuses instead on the morass of complex, interlocking, sometimes contradictory goals that agents may have in practice. We discuss the implications of our model and leave more thorough verification for future work. Aidan Kierans · Hananel Hazan · Shiri Dori-Hacohen 🔗 - Lower Bounds on 0-1 Loss for Multi-class Classification with a Test-time Attacker (Poster) []  Finding classifiers robust to adversarial examples is critical for their safe deployment. Determining the robustness of the best possible classifier under a given threat model and comparing it to that achieved by state-of-the-art training methods is thus an important diagnostic tool. In this paper, we find achievable information-theoretic lower bounds on loss in the presence of a test-time attacker for multi-class classifiers on any discrete dataset. We provide a general framework for computing lower bounds on 0-1 loss based on solving a linear program (LP). This LP is constructed based on what we introduce as a conflict hypergraph, and we explore different settings in the construction of this hypergraph and their impact on the computed lower bound. Our work enables, for the first time, an analysis of the gap to optimal robustness for classifiers in the multi-class setting. Sihui Dai · Wenxin Ding · Arjun Nitin Bhagoji · Daniel Cullina · Prateek Mittal · Ben Zhao 🔗 - HEAT: Hybrid Energy Based Model in the Feature Space for Out-of-Distribution Detection (Poster) []  In this work, we study the out-of-distribution (OOD) detection problem through the use of the feature space of a pre-trained deep classifier. We show that learning the density of in-distribution (ID) features with an energy-based models (EBM) leads to competitive detection results. However, we found that the non-mixing of MCMC sampling during the EBM's training undermines its detection performance. To overcome this, we introduce HEAT, an energy-based correction of a mixture of class-conditional Gaussian distributions. We show that HEAT obtains favorable results when compared to a strong baseline like the KNN detector on the CIFAR-10/CIFAR-100 OOD detection benchmarks. Marc Lafon · Clément Rambour · Nicolas THOME 🔗 - Fine-grain Inference on Out-of-Distribution Data with Hierarchical Classification (Poster) []  Machine learning methods must be trusted to make appropriate decisions in real-world environments, even when faced with out-of-distribution (OOD) samples. Many current approaches simply aim to detect OOD examples and alert the user when an unrecognized input is given. However, when the OOD sample significantly overlaps with the training data, a binary anomaly detection is not interpretable or explainable, and provides little information to the user. We propose a new model for OOD detection that makes predictions at varying levels of granularity—as the inputs become more ambiguous, the model predictions become coarser and more conservative. Randolph Linderman · Jingyang Zhang · Nathan Inkawhich · Hai Li · Yiran Chen 🔗 - Indiscriminate Data Poisoning Attacks on Neural Networks (Poster) []  []  Data poisoning attacks, in which a malicious adversary aims to influence a model by injecting poisoned'' data into the training process, have attracted significant recent attention. In this work, we take a closer look at existing poisoning attacks and connect them with old and new algorithms. By choosing an appropriate loss function for the attacker and optimizing with algorithms that exploit second-order information, we design poisoning attacks that are effective on neural networks.We present efficient implementations by parameterizing the attacker and allowing simultaneous and coordinated generation of tens of thousands of poisoned points, in contrast to existing methods that generate poisoned points one by one. We further perform extensive experiments that empirically explore the effect of data poisoning attacks on deep neural networks. Our paper set up a new benchmark on the possibility of performing indiscriminate data poisoning attacks on modern neural networks. Yiwei Lu · Gautam Kamath · Yaoliang Yu 🔗 - Mitigating Lies in Vision-Language Models (Poster) []  In this work, we bring new insights into the honesty of vision-language models,particularly in visual question answering (VQA). After a throughout revisit of theexisting ‘lie’ behavior in pure language models, our work makes an unprecedentedextension of ’lies’ to vision-language models. The results indicate that the lieprefixes have a more obvious misleading effect on vision-language models thanon language models. We also propose a novel visual prefix and prove that theconsistent vision-language prefix is more threatening to vision-language models.To defend the models from the stated ’lies’, we put forward an unsupervisedframework based on Gaussian mixture modeling and obtain improvement with 3%against the language prefix and 12% against the vision-language prefix. Junbo Li · Xianhang Li · Cihang Xie 🔗 - Risk-aware Bayesian Reinforcement Learning for Cautious Exploration (Poster) []  This paper addresses the problem of maintaining safety during training in Reinforcement Learning (RL), such that the safety constraint violations are bounded at any point during learning. Whilst enforcing safety during training might limit the agent's exploration, we propose a new architecture that handles the trade-off between efficient progress in exploration and safety maintenance. As the agent's exploration progresses, we update Dirichlet-Categorical models of the transition probabilities of the Markov decision process that describes the agent's behavior within the environment by means of Bayesian inference. We then propose a way to approximate moments of the agent's belief about the risk associated with the agent's behavior originating from local action selection. We demonstrate that this approach can be easily coupled with RL, we provide rigorous theoretical guarantees, and we present experimental results to showcase the performance of the overall architecture. Rohan Mitta · Hosein Hasanbeig · Daniel Kroening · Alessandro Abate 🔗 - The Expertise Problem: Learning from Specialized Feedback (Poster) []  []  Reinforcement learning from human feedback (RLHF) is a powerful technique for training agents to perform difficult-to-specify tasks. However, human feedback can be noisy, particularly when human teachers lack relevant knowledge or experience. Levels of expertise vary across teachers, and a given teacher may have differing levels of expertise for different components of a task. RLHF algorithms that learn from multiple teachers therefore face an expertise problem: the reliability of a given piece of feedback depends both on the teacher that it comes from and how specialized that teacher is on relevant components of the task. Existing state-of-the-art RLHF algorithms assume that all evaluations come from the same distribution, obscuring this inter- and intra-human variance, and preventing them from accounting for or taking advantage of variations in expertise. We formalize this problem, implement it as an extension of an existing RLHF benchmark, evaluate the performance of a state-of-the-art RLHF algorithm, and explore techniques to improve query and teacher selection. Our key contribution is to demonstrate and characterize the expertise problem, and to provide an open-source implementation for testing future solutions. Oliver Daniels-Koch · Rachel Freedman 🔗 - Cryptographic Auditing for Collaborative Learning (Poster) []  []  Collaborative machine learning paradigms based on secure multi-party computation have emerged as a compelling alternative for sensitive applications in the last few years. These paradigms promise to unlock the potential of important data silos that are currently hard to access and compute across due to privacy concerns and regulatory policies (e.g., health and financial sectors). Although collaborative machine learning provides many privacy benefits, it makes sacrifices in terms of robustness. It opens the learning process to the possibility of an active malicious participant who can covertly influence the model’s behavior. As these systems are being deployed for a range of sensitive applications, their robustness is increasingly important. To date, no compelling solution exists that fully addresses the robustness of secure collaborative learning paradigms. As the robustness of these learning paradigms remains an open challenge, it is necessary to augment these systems with measures that strengthen their reliability at deployment time. This paper describes our efforts in developing privacy-preserving auditing mechanisms for secure collaborative learning. We focus on audits that allow tracing the source of integrity issues back to the responsible party, providing a technical path towards accountability in these systems. Hidde Lycklama · Nicolas Küchler · Alexander Viand · Emanuel Opel · Lukas Burkhalter · Anwar Hithnawi 🔗 - Certifiable Metric One Class Learning with adversarially trained Lipschitz Classifier (Poster) []  We propose a new Novelty Detection and One Class classifier, based on the smoothness properties of orthogonal neural network, and on the properties of Hinge Kantorovich Rubinstein (HKR) function. The classifier benefits from robustness certificates against $l2$-attacks thanks to the Lipschitz constraint, whilst the HKR loss allows to provably approximate the signed distance function to the boundary of the distribution: the normality score induces by the classifier has a meaningful interpretation in term of distance to the support. Finally, gradient steps in the input space allows free generation of samples from the one class in a fashion that reminds GAN or VAE. Louis Béthune · Mathieu Serrurier 🔗 - An Adversarial Robustness Perspective on the Topology of Neural Networks (Poster) []  []  In this paper, we investigate the impact of NNs topology on adversarial robustness. Specifically, we study the graph produced when an input traverses all the layers of a NN, and show that such graphs are different for clean and adversarial inputs. We find that graphs from clean inputs are more centralized around highway edges, whereas those from adversaries are more diffuse, leveraging under-optimized edges. Through experiments on a variety of datasets and architectures, we show that these under-optimized edges are a source of vulnerability and that they can be used to detect adversarial inputs. Morgane Goibert · Elvis Dohmatob · Thomas Ricatte 🔗 - Falsehoods that ML researchers believe about OOD detection (Poster) []  []  An intuitive way to detect out-of-distribution (OOD) data is via the density function of a fitted probabilistic generative model: points with low density may be classed as OOD. But this approach has been found to fail, in deep learning settings. In this paper, we list some falsehoods that machine learning researchers believe about density-based OOD detection. Many recent works have proposed likelihood-ratio-based methods to fix' the problem. We propose a framework, the OOD proxy framework, to unify these methods, and we argue that likelihood ratio is a principled method for OOD detection and not a merefix'. Finally, we discuss the relationship between domain discrimination and semantics. Andi Zhang · Damon Wischik 🔗 - Ignore Previous Prompt: Attack Techniques For Language Models (Poster) []  []  Transformer-based large language models (LLMs) provide a powerful foundation for natural language tasks in large-scale customer-facing applications. However, studies that explore their vulnerabilities emerging from malicious user interaction are scarce. By proposing PromptInject, a prosaic alignment framework for mask-based iterative adversarial prompt composition, we examine how GPT-3, the most widely deployed language model in production, can be easily misaligned by simple handcrafted inputs. In particular, we investigate two types of attacks -- goal hijacking and prompt leaking -- and demonstrate that even low-aptitude, but sufficiently ill-intentioned agents, can easily exploit GPT-3's stochastic nature, creating long-tail risks. The code for PromptInject is available at https://github.com/agencyenterprise/PromptInject. Fabio Perez · Ian Ribeiro 🔗 - Towards Adversarial Purification using Denoising AutoEncoders (Poster) []  With the rapid advancement and increased use of deep learning models in image identification, security becomes a major concern to their deployment in safety-critical systems. The deep learning architectures are often susceptible to adversarial attacks which are often obtained by making subtle perturbations to normal images, which are mostly imperceptible to humans, but can seriously confuse the state-of-the-art machine learning models. We propose a framework, named APuDAE, leveraging Denoising AutoEncoders (DAEs) to purify these samples by using them in an adaptive way and thus improve the classification accuracy of the target classifier networks. We also show how using DAEs adaptively instead directly, improves classification accuracy further and is more robust to the possibility of designing adaptive attacks to fool them. We demonstrate our results over MNIST, CIFAR-10, ImageNet dataset and show how our framework APuDAE provides comparable and in most cases better performance to the baseline methods in purifying adversaries. Dvij Kalaria · Aritra Hazra · Partha Chakrabarti 🔗 - Continual Poisoning of Generative Models to Promote Catastrophic Forgetting (Poster) []  []  Generative models have grown into the workhorse of many state-of-the-art machine learning methods. However, their vulnerability under poisoning attacks has been largely understudied. In this work, we investigate this issue in the context of continual learning, where generative replayers are utilized to tackle catastrophic forgetting. By developing a novel customization of dirty-label input-aware backdoor to the online setting, our attacker manages to stealthily promote forgetting while retaining high accuracy at the current task and sustaining strong defenders. Our approach taps into an intriguing property of generative models, namely that they cannot well capture input-dependent triggers. Experiments on four standard datasets corroborate the poisoner's effectiveness. Siteng Kang · Xinhua Zhang 🔗 - Adversarial Attacks on Transformers-Based Malware Detectors (Poster) []  []  Signature-based malware detectors have proven to be insufficient as even a small change in malignant executable code can bypass these signature-based detectors. Many machine learning-based models have been proposed to efficiently detect a wide variety of malware. Many of these models are found to be susceptible to adversarial attacks - attacks that work by generating intentionally designed inputs that can force these models to misclassify. Our work aims to explore vulnerabilities in the current state of the art malware detectors to adversarial attacks. We train a Transformers-based malware detector, carry out adversarial attacks resulting in a misclassification rate of 23.9% and propose defenses that reduce this misclassification rate to half. An implementation of our work can be found at https://github.com/yashjakhotiya/Adversarial-Attacks-On-Transformers. Yash Jakhotiya · Heramb Patil · Jugal Rawlani 🔗 - A Cooperative Reinforcement Learning Environment for Detecting and Penalizing Betrayal (Poster) []  []  In this paper we present a Reinforcement Learning environment that leverages agent cooperation and communication, aimed at detection, learning and ultimately penalizing betrayal patterns that emerge in the behavior of self-interested agents. We provide a description of game rules, along with interesting cases of betrayal and trade-offs that arise. Preliminary experimental investigations illustrate a) betrayal emergence, b) deceptive agents outperforming honest baselines and c) betrayal detection based on classification of behavioral features, which surpasses probabilistic detection baselines. Finally, we propose approaches for penalizing betrayal, list enhancements and directions for future work and suggest interesting extensions of the environment towards capturing and exploring increasingly complex patterns of social interactions. Nikiforos Pittaras 🔗 - REAP: A Large-Scale Realistic Adversarial Patch Benchmark (Poster) []  []  Machine learning models are known to be susceptible to adversarial perturbation. One famous attack is the adversarial patch, a sticker with a crafted pattern that makes the model incorrectly predict the object it is placed on. This attack presents a critical threat to cyber-physical systems such as autonomous cars. Despite the significance of the problem, conducting research in this setting has been difficult; evaluating attacks and defenses in the real world is exceptionally costly while synthetic data are unrealistic. In this work, we propose the REAP (REalistic Adversarial Patch) Benchmark, a digital benchmark that allows the user to evaluate patch attacks on real images, and under real-world conditions. Built on top of the Mapillary Vistas dataset, our benchmark contains over 14,000 traffic signs. Each sign is augmented with a pair of geometric and lighting transformations, which can be used to apply a digitally generated patch realistically onto the sign, while matching real-world conditions. Using our benchmark, we perform the first large-scale assessments of adversarial patch attacks under realistic conditions. We release our benchmark publicly at https://github.com/wagner-group/reap-benchmark. Nabeel Hingun · Chawin Sitawarin · Jerry Li · David Wagner 🔗 - Adversarial Policies Beat Professional-Level Go AIs (Poster) []  []  We attack the state-of-the-art Go-playing AI system, KataGo, by training an adversarial policy that plays against a frozen KataGo victim. Our attack achieves a >99\% win-rate against KataGo without search, and a >50% win-rate when KataGo uses enough search to be near-superhuman. To the best of our knowledge, this is the first successful end-to-end attack against a Go AI playing at the level of a top human professional. Notably, the adversary does not win by learning to play Go better than KataGo---in fact, the adversary is easily beaten by human amateurs. Instead, the adversary wins by tricking KataGo into ending the game prematurely at a point that is favorable to the adversary. Our results demonstrate that even professional-level AI systems may harbor surprising failure modes. Our results demonstrate that AI systems which are normally superhuman may still be less robust than humans. Example games are available at https://goattack.alignmentfund.org/ Tony Wang · Adam Gleave · Nora Belrose · Tom Tseng · Michael Dennis · Yawen Duan · Viktor Pogrebniak · Joseph Miller · Sergey Levine · Stuart Russell 🔗 - A Deep Dive into Dataset Imbalance and Bias in Face Identification (Poster) []  []  As the deployment of automated face recognition (FR) systems proliferates, bias in these systems is not just an academic question, but a matter of public concern. Media portrayals often center imbalance as the main source of bias, i.e., that FR models perform worse on images of non-white people or women because these demographic groups are underrepresented in training data. Recent academic research paints a more nuanced picture of this relationship. However, previous studies of data imbalance in FR have focused exclusively on the face verification setting, while the face identification setting has been largely ignored, despite being deployed in sensitive applications such as law enforcement. This is an unfortunate omission, as 'imbalance' is a more complex matter in identification; imbalance may arise in not only the training data, but also the testing data, and furthermore may affect the proportion of identities belonging to each demographic group or the number of images belonging to each identity. In this work, we address this gap in the research by thoroughly exploring the effects of each kind of imbalance possible in face identification, and discuss other factors which may impact bias in this setting. Valeriia Cherepanova · Steven Reich · Samuel Dooley · Hossein Souri · John Dickerson · Micah Goldblum · Tom Goldstein 🔗 - Part-Based Models Improve Adversarial Robustness (Poster) []  []  We show that combining human prior knowledge with end-to-end learning can improve the robustness of deep neural networks by introducing a part-based model for object classification. We believe that the richer form of annotation helps guide neural networks to learn more robust features without requiring more samples or larger models. Our model combines a part segmentation model with a tiny classifier and is trained end-to-end to simultaneously segment objects into parts and then classify the segmented object. Empirically, our part-based models achieve both higher accuracy and higher adversarial robustness than a ResNet-50 baseline on all three datasets. For instance, the clean accuracy of our part models is up to 15 percentage points higher than the baseline’s, given the same level of robustness. Our experiments indicate that these models also reduce texture bias and yield better robustness against common corruptions and spurious correlations. The code is publicly available at https://github.com/chawins/adv-part-model. Chawin Sitawarin · Kornrapat Pongmala · Yizheng Chen · Nicholas Carlini · David Wagner 🔗 - Smoothed-SGDmax: A Stability-Inspired Algorithm to Improve Adversarial Generalization (Poster) []  []  Unlike standard training, deep neural networks can suffer from serious overfitting problems in adversarial settings. Recent research [40,39] suggests that adversarial training can have nonvanishing generalization error even if the sample size $n$ goes to infinity. A natural question arises: can we eliminate the generalization error floor in adversarial training? This paper gives an affirmative answer. First, by an adaptation of information-theoretical lower bound on the complexity of solving Lipschitz-convex problems using randomized algorithms, we establish a minimax lower bound $\Omega(s(T)/n)$ given a training loss of $1/s(T)$ for the adversarial generalization gap, where $T$ is the number of iterations, and $s(T)\rightarrow+\infty$ as $T\rightarrow+\infty$. Next, by observing that the nonvanishing generalization error of existing adversarial training algorithms comes from the non-smoothness of the adversarial loss function, we employ a smoothing technique to smooth the adversarial loss function. Based on the smoothed loss function, we design a smoothed SGDmax algorithm achieving a generalization bound $\mathcal{O}(s(T)/n)$, which eliminates the generalization error floor and matches the minimax lower bound. Experimentally, we show that our algorithm improves adversarial generalization on common datasets. Jiancong Xiao · Jiawei Zhang · Zhiquan Luo · Asuman Ozdaglar 🔗 - Hidden Poison: Machine unlearning enables camouflaged poisoning attacks (Poster) []  We introduce camouflaged data poisoning attacks, a new attack vector that arises in the context of machine unlearning and other settings when model retraining may be induced. An adversary first adds a few carefully crafted points to the training dataset such that the impact on the model's predictions is minimal. The adversary subsequently triggers a request to remove a subset of the introduced points at which point the attack is unleashed and the model's predictions are negatively affected. In particular, we consider clean-label targeted attacks (in which the goal is to cause the model to misclassify a specific test point) on datasets including CIFAR-10, Imagenette, and Imagewoof. This attack is realized by constructing camouflage datapoints that mask the effect of a poisoned dataset. Jimmy Di · Jack Douglas · Jayadev Acharya · Gautam Kamath · Ayush Sekhari 🔗 - DrML: Diagnosing and Rectifying Vision Models using Language (Poster) []  []  Recent multi-modal contrastive learning models have demonstrated the ability to learn an embedding space suitable for building strong vision classifiers, by leveraging the rich information in large-scale image-caption datasets. Our work highlights a distinct advantage of this multi-modal embedding space: the ability to diagnose vision classifiers through natural language. The traditional process of diagnosing model behaviors in deployment settings involves labor-intensive data acquisition and annotation. Our proposed method, DrML, can discover high-error data slices, identify influential attributes and further rectify undesirable model behaviors, without requiring any visual data. Through a combination of theoretical explanation and empirical verification, we present conditions under which classifiers trained on embeddings from one modality can be equivalently applied to embeddings from another modality. On a range of image datasets with known error slices, we demonstrate that our method can effectively identify the error slices and influential attributes, and can further use language to rectify failure modes of the classifier. Yuhui Zhang · Jeff Z. HaoChen · Shih-Cheng Huang · Kuan-Chieh Wang · James Zou · Serena Yeung 🔗 - Deceiving the CKA Similarity Measure in Deep Learning (Poster) []  []  Understanding the behaviour of trained deep neural networks is a critical step in allowing reliable deployment of these networks in critical applications. One direction for obtaining insights on neural networks is through comparison of their internal representations. Comparing neural representations in neural networks is thus a challenging but important problem, which has been approached in different ways. The Centered Kernel Alignment (CKA) similarity metric, particularly its linear variant, has recently become a popular approach and has been widely used to compare representations of a network's different layers, of architecturally similar networks trained differently, or of models with different architectures trained on the same data. A wide variety of conclusions about similarity and dissimilarity of these various representations have been made using CKA. In this work we present an analysis that formally characterizes CKA sensitivity to a large class of simple transformations, which can naturally occur in the context of modern machine learning. This provides a concrete explanation of CKA sensitivity to outliers and to transformations that preserve the linear separability of the data, an important generalization attribute. Finally we propose an optimization-based approach for modifying representations to maintain functional behaviour while changing the CKA value. Our results illustrate that, in many cases, the CKA value can be easily manipulated without substantial changes to the functional behaviour of the models, and call for caution when leveraging activation alignment metrics. MohammadReza Davari · Stefan Horoi · Amine Natik · Guillaume Lajoie · Guy Wolf · Eugene Belilovsky 🔗 - A Mechanistic Lens on Mode Connectivity (Poster) []  With the rise of pretrained models, fine-tuning has become increasingly important. However, naive fine-tuning often does not eliminate a model's sensitivity to spurious cues. To understand and address this limitation, we study the geometry of neural network loss landscapes through the lens of mode-connectivity. We tackle two questions: 1) Are models trained on different distributions mode-connected? 2) Can we fine tune a pre-trained model to switch modes? We define a notion of mechanistic similarity based on shared invariances and show linearly-connected modes are mechanistically similar. We find naive fine-tuning yields linearly connected solutions and hence is unable to induce relevant invariances. We also propose and validate a method of mechanistic fine-tuning'' based on our gained insights. Ekdeep S Lubana · Eric Bigelow · Robert Dick · David Krueger · Hidenori Tanaka 🔗 - Visual Prompting for Adversarial Robustness (Poster) []  []  In this work, we leverage visual prompting (VP) to improve adversarial robustness of a fixed, pre-trained model at testing time. Compared to conventional adversarial defenses, VP allows us to design universal (i.e., data-agnostic) input prompting templates, which have plug-and-play capabilities at testing time to achieve desired model performance without introducing much computation overhead. Although VP has been successfully applied to improving model generalization, it remains elusive whether and how it can be used to defend against adversarial attacks. We investigate this problem and show that the vanilla VP approach is not effective in adversarial defense since a universal input prompt lacks the capacity for robust learning against sample-specific adversarial perturbations. To circumvent it, we propose a new VP method, termed Class-wise Adversarial Visual Prompting (C-AVP), to generate class-wise visual prompts so as to not only leverage the strengths of ensemble prompts but also optimize their interrelations to improve model robustness. Our experiments show that C-AVP outperforms the conventional VP method, with 2.1X standard accuracy gain and 2X robust accuracy gain. Compared to classical test-time defenses, C-AVP also yields a 42X inference time speedup. Aochuan Chen · Peter Lorenz · Yuguang Yao · Pin-Yu Chen · Sijia Liu 🔗 - Identification of the Adversary from a Single Adversarial Example (Poster) []  Deep neural networks have been shown vulnerable to adversarial examples. Even though many defence methods have been proposed to enhance the robustness, it is still a long way toward providing an attack-free method to build a trustworthy machine learning system. In this paper, instead of enhancing the robustness, we take the investigator's perspective and propose a new framework to trace the first compromised model in a forensic investigation manner. Specifically, we focus on the following setting: the machine learning service provider provides models for a set of customers. However, one of the customers conducted adversarial attacks to fool the system. Therefore, the investigator's objective is to identify the first compromised model by collecting and analyzing evidence from only available adversarial examples. To make the tracing viable, we design a random mask watermarking mechanism to differentiate adversarial examples from different models. First, we propose a tracing approach in the data-limited case where the original example is also available. Then, we design a data-free approach to identify the adversary without accessing the original example. Finally, the effectiveness of our proposed framework is evaluated by extensive experiments with different model architectures, adversarial attacks, and datasets. Minhao Cheng · Rui Min 🔗 - Mitigating Dataset Bias by Using Per-sample Gradient (Poster) []  []  The performance of deep neural networks is strongly influenced by the training dataset setup. In particular, when attributes having a strong correlation with the target attribute are present, the trained model can provide unintended prejudgments and show significant inference errors (i.e., the dataset bias problem). Various methods have been proposed to mitigate dataset bias, and their emphasis is on weakly correlated samples, called bias-conflicting samples. These methods are based on explicit bias labels provided by human. However, such methods require human costs. Recently, several studies have tried to reduce human intervention by utilizing the output space values of neural networks, such as feature space, logits, loss, or accuracy. However, these output space values may be insufficient for the model to understand the bias attributes well. In this study, we propose a debiasing algorithm leveraging gradient called PGD (Per-sample Gradient-based Debiasing). PGD comprises three steps: (1) training a model on uniform batch sampling, (2) setting the importance of each sample in proportion to the norm of the sample gradient, and (3) training the model using importance-batch sampling, whose probability is obtained in step (2). Compared with existing baselines for various datasets, the proposed method showed state-of-the-art accuracy for the classification task. Sumyeong Ahn · SeongYoon Kim · Se-Young Yun 🔗 - A General Framework for Safe Decision Making: A Convex Duality Approach (Poster) []  We study the problem of online interaction in general decision making problems,where the objective is not only to find optimal strategies, but also to satisfy somesafety guarantees, expressed in terms of costs accrued. We propose a theoreticalframework to address such problems and present BAN-SOLO, a UCB-like algorithm that, in an online interaction with an unknown environment, attains sublinear regret of order O(T^{1/2}) and plays safely with high probability at each iteration. At its core, BAN-SOLO relies on tools from convex duality to manage environment exploration while satisfying the safety constraints imposed by the problem. Martino Bernasconi · Federico Cacciamani · Nicola Gatti · Francesco Trovò 🔗 - A Unifying Framework for Online Safe Optimization (Poster) []  We study online learning problems in which a decision maker has to take a sequence of decisions subject to $m$ \emph{long-term constraints}. The goal of the decision maker is to maximize their total reward, while at the same time achieving small cumulative constraints violation across the $T$ rounds. We present the first \emph{best-of-both-world} type algorithm for this general class of problems, with no-regret guarantees both in the case in which rewards and constraints are selected according to an unknown stochastic model, and in the case in which they are selected at each round by an adversary. Our algorithm is the first to provide guarantees in the adversarial setting with respect to the optimal fixed strategy that satisfies the long-term constraints. In particular, it guarantees a $\rho/(1+\rho)$ fraction of the optimal reward and sublinear regret, where $\rho$ is a feasibility parameter related to the existence of strictly feasible solutions. Our framework employs traditional regret minimizers as black-box components. Therefore, by instantiating it with an appropriate choice of regret minimizers it can handle the \emph{full-feedback} as well as the \emph{bandit-feedback} setting. Moreover, it allows the decision maker to seamlessly handle scenarios with non-convex rewards and constraints. We show how our framework can be applied in the context of budget-management mechanisms for repeated auctions in order to guarantee long-term constraints that are not \emph{packing} (\emph{e.g.}, ROI constraints). Matteo Castiglioni · Andrea Celli · Alberto Marchesi · Giulia Romano · Nicola Gatti 🔗 - Targeted Adversarial Self-Supervised Learning (Poster) []  []  Recently, unsupervised adversarial training (AT) has been extensively studied to attain robustness with the models trained upon unlabeled data. To this end, previous studies have applied existing supervised adversarial training techniques to self-supervised learning (SSL) frameworks. However, all have resorted to untargeted adversarial learning as obtaining targeted adversarial examples is unclear in the SSL setting lacking of label information. In this paper, we propose a novel targeted adversarial training method for the SSL frameworks. Specifically, we propose a target selection algorithm for the adversarial SSL frameworks; it is designed to select the most confusing sample for each given instance based on similarity and entropy, and perturb the given instance toward the selected target sample. Our method is readily applicable to general SSL frameworks that only uses positive pairs. We validate our method on benchmark datasets, on which it obtains superior robust accuracies, outperforming existing unsupervised adversarial training methods. Minseon Kim · Hyeonjeong Ha · Sooel Son · Sung Ju Hwang 🔗 - Canary in a Coalmine: Better Membership Inference with Ensembled Adversarial Queries (Poster) []  []  As industrial applications are increasingly automated by machine learning models, enforcing personal data ownership and intellectual property rights requires tracing training data back to their rightful owners. Membership inference algorithms approach this problem by using statistical techniques to discern whether a target sample was included in a model's training set. However, existing methods only utilize the unaltered target sample or simple augmentations of the target to compute statistics. Such a sparse sampling of the model's behavior carries little information, leading to poor inference capabilities. In this work, we use adversarial tools to directly optimize for queries that are discriminative and diverse. Our improvements achieve significantly more accurate membership inference than existing methods, especially in offline scenarios and in the low false-positive regime which is critical in legal settings. Yuxin Wen · Arpit Bansal · Hamid Kazemi · Eitan Borgnia · Micah Goldblum · Jonas Geiping · Tom Goldstein 🔗 - Can Large Language Models Truly Follow your Instructions? (Poster) []  In this work, to test the capabilities of large language models on truly following the given instructions, we evaluate 9 common NLP benchmarks with negated instructions on (1) pretrained LMs (OPT \& GPT-3) of varying sizes (125M - 175B), (2) LMs further pretrained to generalize to novel instructions (InstructGPT), (3) LMs provided with few-shot examples, and (4) LMs fine-tuned specifically on negated instructions; all LM types perform worse on negated instructions as they scale and show a huge performance gap between the human performance when comparing the average score on both original and negated instructions. By highlighting a critical limitation of existing LMs and methods, we urge the community to develop new approaches to developing LMs that actually follow the given instructions in order to prevent catastrophic consequences that may occur if we prematurely endow LMs with real-world responsibilities. Joel Jang · Seonghyeon Ye · Minjoon Seo 🔗 - Broken Neural Scaling Laws (Poster) []  We present a smoothly broken power law functional form that accurately models and extrapolates the scaling behaviors of deep neural networks (i.e. how the evaluation metric of interest varies as the amount of compute used for training, number of model parameters, training dataset size, or upstream performance varies) for each task within a large and diverse set of upstream and downstream tasks, in zero-shot, prompted, and fine-tuned settings. This set includes large-scale vision and unsupervised language tasks, diffusion generative modeling of images, arithmetic, and reinforcement learning. When compared to other functional forms for neural scaling behavior, this functional form yields extrapolations of scaling behavior that are considerably more accurate (root mean squared log error of its extrapolations are 0.86 times that of previous state-of-the-art on average) on this set. Moreover, this functional form accurately models and extrapolates scaling behavior that other functional forms are incapable of expressing such as the non-monotonic transitions present in the scaling behavior of phenomena such as double descent and the delayed, sharp inflection points present in the scaling behavior of tasks such as arithmetic. Code is available at https://github.com/ethancaballero/brokenneuralscaling_laws Ethan Caballero · kshitij Gupta · Irina Rish · David Krueger 🔗 - Do Domain Generalization Methods Generalize Well? (Poster) []  []  Domain Generalization (DG) methods use data from multiple related source domains to learn models whose performance does not degrade on unseen domains at test time. Many DG algorithms rely on reducing the divergence between the source distributions in a representation space to potentially align unseen domains close to the sources. These algorithms are motivated by the analytical works that explain generalization to unseen domains based on their distributional distance (e.g., Wasserstein distance) to the sources. However, we show that the accuracy of a DG model varies significantly on unseen domains equidistant from the sources in the learned representation space. This makes it hard to gauge the generalization performance of DG models only based on their performance on benchmark datasets. Thus, we study the worst-case loss of a DG model at a particular distance from the sources and propose an evaluation methodology based on distributionally robust optimization that efficiently computes the worst-case loss on all distributions within a Wasserstein ball around the sources. Our results show that models trained with popular DG methods incur a high worst-case loss even close to the sources which show their lack of generalization to unseen domains. Moreover, we observe a large gap between the worst-case and the empirical losses of distributions at the same distance, showing the performance of the DG models on benchmark datasets is not representative of their performance on unseen domains. Thus, our (target) data-independent and worst-case loss-based methodology highlights the poor generalization performance of current DG models and provides insights beyond empirical evaluation on benchmark datasets for improving these models. Akshay Mehra · Bhavya Kailkhura · Pin-Yu Chen · Jihun Hamm 🔗 - What You See is What You Get: Principled Deep Learning via Distributional Generalization (Poster) []  []  Having similar behavior at train-time and test-time---what we call a What You See Is What You Get (WYSIWYG)'' property---is desirable in machine learning. However, models trained with standard stochastic gradient descent (SGD) are known to not capture it. Their behaviors such as subgroup performance, or adversarial robustness can be very different during training and testing. We show that Differentially-Private (DP) training provably ensures the high-level WYSIWYG property, which we quantify using a notion of Distributional Generalization (DG). Applying this connection, we introduce new conceptual tools for designing deep-learning methods by reducing generalization concerns to optimization ones: to mitigate unwanted behavior at test time, it is provably sufficient to mitigate this behavior on the train datasets. By applying this novel design principle, which bypassespathologies'' of SGD, we construct simple algorithms that are competitive with SOTA in several distributional robustness applications, significantly improve the privacy vs. disparate impact tradeoff of DP-SGD, and mitigate robust overfitting in adversarial training. Finally, we also improve on known theoretical bounds relating DP, stability, and distributional generalization. Bogdan Kulynych · Yao-Yuan Yang · Yaodong Yu · Jaroslaw Blasiok · Preetum Nakkiran 🔗 - Adversarial poisoning attacks on reinforcement learning-driven energy pricing (Poster) []  []  Reinforcement learning (RL) has emerged as a strong candidate for implementing complex controls in energy systems, such as energy pricing in microgrids. But what happens when some of the microgrid controllers are compromised by a malicious entity? We demonstrate a novel attack in RL.Our attack perturbs each trajectory to reverse the direction of the estimated gradient. We demonstrate that if data from a small fraction of microgrid controllers is adversarially perturbed, the learning of the RL agent can be significantly slowed or (with larger perturbations) caused to operate at a loss. Prosumers also face higher energy costs, use their batteries less, and suffer from higher peak demand when the pricing aggregator is adversarially poisoned. We address this vulnerability with a “defense” module; i.e., a robustification'' of RL algorithms against this attack. Our defense identifies the trajectories with the largest influence on the gradient and removes them from the training data. Sam Gunn · Doseok Jang · Orr Paradise · Lucas Spangher · Costas J Spanos 🔗 - OOD Detection with Class Ratio Estimation (Poster) []  []  Density-based Out-of-distribution (OOD) detection has recently been shown unreliable for the task of detecting OOD images. Various density ratio based approaches have achieved good empirical performance. However, these methods typically lack a principled probabilistic modelling explanation. In this work, we propose to unify density ratio based methods under a novel energy-based model framework that allows us to view the density ratio as the unnormalized density of an implicit semantic distribution. Further, we propose to directly estimate the density ratio of a data sample through class ratio estimation, which can achieve competitive OOD detection results without training any deep generative models. Our approach enables a simple yet effective path towards solving OOD detection problems in the image domain. Mingtian Zhang · Andi Zhang · Tim Xiao · Yitong Sun · Steven McDonagh 🔗 - Alignment as a Dynamic Process (Poster) []  []  Most learning AIs today have exogenously given and fixed aims which they gradually learn to optimize for. It has been an assumption in alignment research that artificial general intelligences of the kind that could pose an X-risk would too. On this assumption, value alignment becomes the task of finding the right set of aims before we allow the agent to act. However, an agent can also have aims that fundamentally change during their lifetime. The task of aligning such agents is not one of specifying a set of aims, but of designing a meta-function that guides the agent’s developing aims to an equilibrium that produces behaviour aligned with our human values. If artificial general intelligences would possess such dynamic aims, then this has significant implications for the kind of alignment research we should conduct today. In this paper, I argue that there is a substantial probability that artificial general intelligences would have such dynamic aims, and in response I articulate an agenda for dynamic alignment research. Paul de Font-Reaulx 🔗 - The Use of Non-epistemic Values to Account for Bias in Automated Decision Making (Poster) []  []  We consider the algorithmic shortlist problem of how to rank a list of choices for a decision. As the choices on a ballot are as important as the votes themselves, the decisions of who to hire, who to insure, or who to admit, are directly dependent to who is considered, who is categorized, or who meets the threshold for admittance. We frame this problem as one requiring additional non-epistemic context that we use to normalize expected values, and propose a computational model for this context based on a social-psychological model of affect in social interactions. Jesse Hoey · Gabrielle Chan · Mathieu Doucet · Christopher Risi · Freya Zhang 🔗 - Few-Shot Transferable Robust Representation Learning via Bilevel Attacks (Poster) []  []  Existing adversarial learning methods assume the availability of a large amount of data from which we can generate adversarial examples. However, in an adversarial meta-learning setting, the model need to learn transferable robust representations for unseen domains with only a few adversarial examples, which is a very difficult goal to achieve even with a large amount of data. To tackle such a challenge, we propose a novel adversarial self-supervised meta-learning framework with bilevel attacks which aims to learn robust representations that can generalize across tasks and domains. Specifically, in the inner loop, we update the parameters of the given encoder by taking inner gradient steps using two different sets of augmented samples, and generate adversarial examples for each view by maximizing the instance classification loss. Then, in the outer loop, we meta-learn the encoder parameter to maximize the agreement between the two adversarial examples, which enables it to learn robust representations. We experimentally validate the effectiveness of our approach on unseen domain adaptation tasks, on which it achieves impressive performance. Specifically, our method significantly outperforms the state-of-the-art meta-adversarial learning methods on few-shot learning tasks, as well as self-supervised learning baselines in standard learning settings with large-scale datasets. Minseon Kim · Hyeonjeong Ha · Sung Ju Hwang 🔗 - What 'Out-of-distribution' Is and Is Not (Poster) []  []  Researchers want to generalize robustly to ‘out-of-distribution’ (OOD) data. Unfortunately, this term is used ambiguously causing confusion and creating risk—people might believe they have made progress on OOD data and not realize this progress only holds in limited cases. We critique a standard definition of OOD—difference-in-distribution—and then disambiguate four meaningful types of OOD data: transformed-distributions, related-distributions, complement-distributions, and synthetic-distributions. We describe how existing OOD datasets, evaluations, and techniques fit into this framework. We provide a template for researchers to carefully present the scope of distribution shift considered in their work. Sebastian Farquhar · Yarin Gal 🔗 - Adversarial Robustness of Deep Inverse Reinforcement Learning (Poster) []  Reinforcement learning research experienced substantial jumps in its progress after the first achievement on utilizing deep neural networks to approximate the state-action value function in high-dimensional states. While deep reinforcement learning algorithms are currently being employed in many different tasks from industrial control to biomedical applications, the fact that an MDP has to provide a clear reward function limits the tasks that can be achieved via reinforcement learning. In this line of research, some studies proposed to directly learn a policy from observing expert trajectories (i.e. imitation learning), and others proposed to learn a reward function from the expert demonstrations (i.e. inverse reinforcement learning). In this paper we will focus on robustness and vulnerabilities of deep imitation learning and deep inverse reinforcement learning policies. Furthermore, we will layout non-robust features learnt by the deep inverse reinforcement learning policies. We conduct experiments in the Arcade Learning Environment (ALE), and compare the non-robust features learnt by the deep inverse reinforcement learning algorithms to vanilla trained deep reinforcement learning policies. We hope that our study can provide a basis for the future discussions on the robustness of both deep inverse reinforcement learning and deep reinforcement learning. Ezgi Korkmaz 🔗 - Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation (Poster) []  []  We introduce a method to measure uncertainty in large language models.For tasks like question answering, it is essential to know when we can trust the natural language outputs of foundation models.We show that measuring uncertainty in natural language is challenging because of semantic equivalence—different sentences can mean the same thing.To overcome these challenges we introduce semantic entropy—an entropy which incorporates linguistic invariances created by shared meanings.Our method is unsupervised, uses only a single model, and requires no modifications to off-the-shelf language models.In comprehensive ablation studies we show that the semantic entropy is more predictive of model accuracy on question answering data sets than comparable baselines. Lorenz Kuhn · Yarin Gal · Sebastian Farquhar 🔗 - On Outlier Exposure with Generative Models (Poster) []  []  While Outlier Exposure reliably increases the performance of Out-of-Distribution detectors, it requires a set of available outliers during training. In this paper, we propose Generative Outlier Exposure (GOE), which alleviates the need for available outliers by using generative models to sample synthetic outliers from low-density regions of the data distribution. The approach requires no modification of the generator, works on image and text data, and can be used with pre-trained models. We demonstrate the effectiveness of generated outliers on several image and text datasets, including ImageNet. Konstantin Kirchheim · Frank Ortmeier 🔗 - An Efficient Framework for Monitoring Subgroup Performance of Machine Learning Systems (Poster) []  []  Monitoring machine learning systems post deployment is critical to ensure the reliability of the systems. Particularly importance is the problem of monitoring the performance of machine learning systems across all the data subgroups (subpopulations). In practice, this process could be prohibitively expensive as the number of data subgroups grows exponentially with the number of input features, and the process of labelling data to evaluate each subgroup's performance is costly. In this paper, we propose an efficient framework for monitoring subgroup performance of machine learning systems. Specifically, we aim to find the data subgroup with the worst performance using a limited number of labeled data. We mathematically formulate this problem as an optimization problem with an expensive black-box objective function, and then suggest to use Bayesian optimization to solve this problem. Our experimental results on various real-world datasets and machine learning systems show that our proposed framework can retrieve the worst-performing data subgroup effectively and efficiently. Huong Ha 🔗 - Spectral Robustness Analysis of Deep Imitation Learning (Poster) []  Deep reinforcement learning algorithms enabled learning functioning policies in MDPs with complex state representations. Following these advancements deep reinforcement learning polices have been deployed in many diverse settings. However, a line of research argued that in certain settings building a reward function can be more complicated than learning it. Hence, several studies proposed different methods to learn a reward function by observing trajectories of a functioning policy (i.e. inverse reinforcement learning). Following this line of research several studies proposed to directly learn a functioning policy by solely observing trajectories of an expert (i.e. imitation learning). In this paper, we propose a novel method to analyze the spectral robustness of deep neural policies. We conduct several experiments in the Arcade Learning Environment, and demonstrate that simple vanilla trained deep reinforcement learning policies are more robust than deep inverse reinforcement learning policies. We believe that our method provides a comprehensive analysis on the policy robustness and can help understanding the fundamental properties of different training techniques. Ezgi Korkmaz 🔗 - Interpretable Reward Learning via Differentiable Decision Trees (Poster) []  []  There is an increasing interest in learning rewards and models of human intent from human feedback. However, many methods use blackbox learning methods that, while expressive, are hard to interpret. We propose a novel method for learning expressive and interpretable reward functions from preference feedback using differentiable decision trees. We test our algorithm on two test domains, demonstrating the ability to learn interpretable reward functions from both low- and high-dimensional visual state inputs. Furthermore, we provide preliminary evidence that the tree structure of our learned reward functions is useful in determining the extent to which a reward function is aligned with human preferences. Akansha Kalra · Daniel S. Brown 🔗 - Steering Large Language Models using APE (Poster) []  []  By conditioning on natural language instructions, large language models (LLMs) have displayed impressive capabilities as general-purpose computers. However, task performance depends significantly on the quality of the prompt used to steer the model. Due to the lack of knowledge of how LLMs work, most effective prompts have been handcrafted by humans through a demanding trial and error process. To reduce the human effort involved in this alignment process, we propose Automatic Prompt Engineer (APE) for automatic instruction generation and selection. We treat the instruction as the "program," optimized by searching over a pool of instruction candidates proposed by an LLM in order to maximize a chosen score function. To evaluate how well the selected instruction can steer the model to desired behavior, we evaluate the zero-shot performance of another LLM following the selected instruction. Experiments on 24 NLP tasks show that our automatically generated instructions outperform the prior LLM baseline by a large margin and achieve better or comparable performance to the instructions generated by human annotators on 19/24 tasks. Moreover, we show that APE-engineered prompts can be applied to steer models toward truthfulness and/or informativeness. Please check out our webpage at https://sites.google.com/view/automatic-prompt-engineer. Yongchao Zhou · Andrei Muresanu · Ziwen Han · Keiran Paster · Silviu Pitis · Harris Chan · Jimmy Ba 🔗 - A Multi-Level Framework for the AI Alignment Problem (Poster) []  []  AI alignment considers how we can encode AI systems in a way that is compatible with human values. The normative side of this problem asks what moral values or principles, if any, we should encode in AI. To this end, we present a framework to consider the question at four levels: Individual, Organizational, National, and Global. We aim to illustrate how AI alignment is made up of value alignment problems at each of these levels, where values at each level affect the others and effects can flow in either direction. We outline key questions and considerations of each level and demonstrate an application of this framework to the topic of AI content moderation. Betty L Hou · Brian Green 🔗 - Error Resilient Deep Neural Networks using Neuron Gradient Statistics (Poster) []  []  Deep neural networks (DNNs) have been widely adopted in daily life with applications ranging from face recognition to recommender systems. However, the specialized hardware used to run these systems is vulnerable to errors in computation that adversely impact accuracy. Conventional error tolerance methods cannot easily be used here due to their substantial overhead and the need to modify training algorithms to accommodate error resilience. To address this issue, this paper presents a novel approach taking advantage of the statistics of neurons’ gradients with respect to their neighbors to identify and suppress erroneous neuron values. The approach is modular and is combined with an accurate, low-overhead error detection mechanism to ensure it is used only when needed, further reducing its effective cost. Deep learning models can be trained using conventional algorithms and our error correction module is fit to a trained DNN, achieving comparable or superior performance relative to baseline error correction methods. Results are presented with emphasis on scalability with regard to dataset and network size, as well as different network architectures. Chandramouli Amarnath · Abhijit Chatterjee · Kwondo Ma · Mohamed Mejri 🔗 - Aligning Robot Representations with Humans (Poster) []  []  As robots are increasingly deployed in real-world environments, a key question becomes how to best teach them to accomplish tasks that humans want. In this work, we argue that current robot learning approaches suffer from representation misalignment, where the robot's learned task representation does not capture the human's true representation. We propose that because humans will be the ultimate evaluator of task performance in the world, it is crucial that we explicitly focus our efforts on aligning robot representations with humans, in addition to learning the downstream task. We advocate that current representation learning approaches in robotics can be studied under a single unifying formalism: the representation alignment problem. We mathematically operationalize this problem, define its key desiderata, and situate current robot learning methods within this formalism. Andreea Bobu · Andi Peng · Pulkit Agrawal · Julie A Shah · Anca Dragan 🔗 - Deep Reinforcement Learning Policies Learn Shared Adversarial Directions Across MDPs (Poster) []  The use of deep neural networks as function approximators has led to striking progress for reinforcement learning algorithms and applications. Yet the knowledge we have on decision boundary geometry and the loss landscape of neural policies is still quite limited. In this paper, we propose a framework to investigate the decision boundary and loss landscape similarities across states and across MDPs. We conduct experiments in various games from Arcade Learning Environment, and discover that high sensitivity directions for neural policies are correlated across MDPs. We argue that these high sensitivity directions support the hypothesis that non-robust features are shared across training environments of reinforcement learning agents. We believe our results reveal fundamental properties of the environments used in deep reinforcement learning training, and represent a tangible step towards building robust and reliable deep reinforcement learning agents. Ezgi Korkmaz 🔗 - Instance-Aware Observer Network for Out-of-Distribution Object Segmentation (Poster) []  []  Recent works on predictive uncertainty estimation have shown promising results on Out-Of-Distribution (OOD) detection for semantic segmentation. However, these methods struggle to precisely locate the point of interest in the image, i.e, the anomaly. This limitation is due to the difficulty of fine-grained prediction at the pixel level. To address this issue, we build upon the recent ObsNet approach by providing object instance knowledge to the observer. We extend ObsNet by harnessing an instance-wise mask prediction. We use an additional, class agnostic, object detector to filter and aggregate observer predictions. Finally, we predict an unique anomaly score for each instance in the image. We show that our proposed method accurately disentangles in-distribution objects from OOD objects on three datasets. Victor Besnier · Andrei Bursuc · Alexandre Briot · David Picard 🔗 - A general framework for reward function distances (Poster) []  []  In reward learning, it is helpful to be able to measure distances between reward functions, for example to evaluate learned reward models. Using simple metrics such as L^2 distances is not ideal because reward functions that are equivalent in terms of their optimal policies can nevertheless have high L^2 distance. EPIC and DARD are distances specifically designed for reward functions that address this by being invariant under certain transformations that leave optimal policies unchanged. However, EPIC and DARD are designed in an ad-hoc manner, only consider a subset of relevant reward transformations, and suffer from serious pathologies in some settings. In this paper, we define a general class of reward function distance metrics, of which EPIC is a special case. This framework lets as address all these issues with EPIC and DARD, and allows for the development of reward function distance metrics in a more principled manner. Erik Jenner · Joar Skalse · Adam Gleave 🔗 - Certifiable Robustness Against Patch Attacks Using an ERM Oracle (Poster) []  []  Consider patch attacks, where at test-time an adversary manipulates a test image with a patch in order to induce a targeted mis-classification. We consider a recent defense to patch attacks, Patch-Cleanser (Xiang et al., 2022). The Patch-Cleanser algorithm requires a prediction model to have a “two-mask correctness” property, meaning that the prediction model should correctly classify any image whenany two blank masks replace portions of the image. To this end, Xiang et al. (2022) learn a prediction model to be robust to two-mask operations by augmenting the training set by adding pairs of masks at random locations of training images, and performing empirical risk minimization (ERM) on the augmented dataset. However, in the non-realizable setting when no predictor is perfectly correct on all two-mask operations on all images, we exhibit an example where ERM fails. To overcome this challenge, we propose a different algorithm that provably learns a predictor robust to all two-mask operations using an ERM oracle, based on prior work by Feige et al. (2015a) . Kevin Stangl · Avrim Blum · Omar Montasser · Saba Ahmadi 🔗 - On the Adversarial Robustness of Vision Transformers (Poster) []  Following the success in advancing natural language processing and understanding, transformers are expected to bring revolutionary changes to computer vision. This work provides a comprehensive study on both empirical and certified robustness of vision transformers (ViTs), with analysis that casts light on creating models that resist adversarial attacks. We find that ViTs possess better empirical and certified adversarial robustness when compared with various baselines. In our frequency study, we show features learned by ViTs contain less high-frequency patterns which tend to have spurious correlation, and there is a high correlation between how much the model learns high-frequency features and its robustness against different frequency-based perturbations. Moreover, modern CNN designs that borrow techniques from ViTs including activation function, layer norm, larger kernel size to imitate the global attention, and patchify the images as inputs, etc., could help bridge the performance gap between ViTs and CNNs not only in terms of performance, but also certified and empirical adversarial robustness. Introducing convolutional or tokens-to-token blocks for learning high-frequency features in ViTs can improve classification accuracy but at the cost of adversarial robustness. Rulin Shao · Zhouxing Shi · Jinfeng Yi · Pin-Yu Chen · Cho-Jui Hsieh 🔗 - Unified Probabilistic Neural Architecture and Weight Ensembling Improves Model Robustness (Poster) []  []  Robust machine learning models with accurately calibrated uncertainties are crucial for safety-critical applications. Probabilistic machine learning and especially the Bayesian formalism provide a systematic framework to incorporate robustness through the distributional estimates and reason about uncertainty. Recent works have shown that approximate inference approaches that take the weight space uncertainty of neural networks to generate ensemble prediction are the state-of-the-art. However, architecture choices have mostly been ad hoc, which essentially ignores the epistemic uncertainty from the architecture space. To this end, we propose a Unified probabilistic architecture and weight ensembling Neural Architecture Search (UraeNAS) that leverages advances in probabilistic neural architecture search and approximate Bayesian inference to generate ensembles form the joint distribution of neural network architectures and weights. The proposed approach showed a significant improvement both with in-distribution (0.86% in accuracy, 42% in ECE) CIFAR-10 and out-of-distribution (2.43% in accuracy, 30% in ECE) CIFAR-10-C compared to the baseline deterministic approach.  Sumegha Premchandar · Sanket Jantre · Prasanna Balaprakash · Sandeep Madireddy 🔗 - All’s Well That Ends Well: Avoiding Side Effects with Distance-Impact Penalties (Poster) []  Misspecifying the reward function of a reinforcement learning agent may cause catastrophic side effects.In this work, we investigate \textit{distance-impact penalties}: a general-purpose auxiliary reward based on a state-distance measure that captures, and thus can be used to penalise, side effects. We prove that the size of the penalty depends only on an agent's final impact on the environment.Distance-impact penalties are scalable, general, and immediately compatible with model-free algorithms.We analyse the sensitivity of an agent's behaviour to the choice of penalty, expanding results about reward-shaping, proving sufficient and necessary conditions for policy-optimality to be invariant to misspecification, and providing error bounds for optimal policies. Finally, we empirically investigate distance-impact penalties in a range of grid-world environments, demonstrating their ability to prevent side effects whilst permitting task completion. Charlie Griffin · Joar Skalse · Lewis Hammond · Alessandro Abate 🔗 - System III: Learning with Domain Knowledge for Safety Constraints (Poster) []  Reinforcement learning agents naturally learn from extensive exploration. Exploration is costly and can be unsafe in safety-critical domains. This paper proposes a novel framework for incorporating domain knowledge to help guide safe exploration and boost sample efficiency. Previous approaches impose constraints, such as regularisation parameters in neural networks, that rely on large sample sets and often are not suitable for safety-critical domains where agents should almost always avoid unsafe actions. In our approach, called System III, which is inspired by psychologists' notions of the brain's System I and System IIwe represent domain expert knowledge of safety in form of first-order logic. We evaluate the satisfaction of these constraints via p-norms in state vector space. In our formulation, constraints are analogous to hazards, objects, and regions of state that have to be avoided during exploration.We evaluated the effectiveness of the proposed method on OpenAI's Gym and Safety-Gym environments.In all tasks, including classic Control and Safety Games, we show that our approach results in safer exploration and sample efficiency. Fazl Barez · Hosein Hasanbeig · Alessandro Abate 🔗 - Can Active Sampling Reduce Causal Confusion in Offline Reinforcement Learning? (Poster) []  []  Causal confusion is a phenomenon where an agent learns a policy that reflects imperfect spurious correlations in the data. Such a policy may appear to be optimal during training if most of the training data contains these spurious correlations.This problem gets exacerbated in domains such as robotics with potentially large gaps between open- and closed-loop performance of an agent.In such cases, a causally confused model may appear to perform well according to open-loop metrics but fail catastrophically when deployed in the real world.In this paper, we conduct the first study of causal confusion in offline reinforcement learning.We hypothesize that selectively sampling data points that help disambiguate the underlying causal mechanisms of the environment, may alleviate causal confusion. To investigate this hypothesis, we consider a set of simulated setups to study causal confusion and the ability of active sampling schemes to reduce its effects.We provide empirical evidence that random and active sampling schemes are able to consistently reduce causal confusion as training progresses and that active sampling is able to do so more efficiently than uniform sampling. Gunshi Gupta · Tim G. J. Rudner · Rowan McAllister · Adrien Gaidon · Yarin Gal 🔗 - Boundary Adversarial Examples Against Adversarial Overfitting (Poster) []  []  Standard adversarial training approaches suffer from robust overfitting where the robust accuracy decreases when models are adversarially trained for too long. The origin of this problem is still unclear and conflicting explanations have been reported, i.e., memorization effects induced by large loss data or because of small loss data and growing differences in loss distribution of training samples as the adversarial training progresses. Consequently, several mitigation approaches including early stopping, temporal ensembling and weight perturbations on small loss data have been proposed to mitigate the effect of robust overfitting. However, a side effect of these strategies is a larger reduction in clean accuracy compared to standard adversarial training. In this paper, we investigate if these mitigation approaches are complimentary to each other in improving adversarial training performance. We further propose the use of helper adversarial examples that can be obtained with minimal cost in the adversarial example generation, and show how they increase the clean accuracy in the existing approaches without compromising the robust accuracy. Muhammad Zaid Hameed · Beat Buesser 🔗 - Two-Turn Debate Does Not Help Humans Answer Hard Reading Comprehension Questions (Poster) []  []  The use of language-model-based question-answering systems to aid humans in completing difficult tasks is limited, in part, by the unreliability of the text these systems generate. Using hard multiple-choice reading comprehension questions as a testbed, we assess whether presenting humans with arguments for two competing answer options, where one is correct and the other is incorrect, allows human judges to perform more accurately, even when one of the arguments is unreliable and deceptive. If this is helpful, we may be able to increase our justified trust in language-model-based systems by asking them to produce these arguments where needed. Previous research has shown that just a single turn of arguments in this format is not helpful to humans. However, as debate settings are characterized by a back-and-forth dialogue, we follow up on previous results to test whether adding a second round of counter-arguments is helpful to humans. We find that, regardless of whether they have access to arguments or not, humans perform similarly on our task. These findings suggest that, in the case of answering reading comprehension questions, debate is not a helpful format. Alicia Parrish · Harsh Trivedi · Nikita Nangia · Jason Phang · Vishakh Padmakumar · Amanpreet Singh Saimbhi · Samuel Bowman 🔗 - Panning for Gold in Federated Learning: Targeted Text Extraction under Arbitrarily Large-Scale Aggregation (Poster) []  As federated learning (FL) matures, privacy attacks against FL systems in turn become more numerous and complex. Attacks on language models have progressed from recovering single sentences in simple classification tasks to recovering larger parts of user data. Current attacks against federated language models are sequence-agnostic and aim to extract as much data as possible from an FL update - often at the expense of fidelity for any particular sequence. Because of this, current attacks fail to extract any meaningful data under large-scale aggregation. In realistic settings, an attacker cares most about a small portion of user data that contains sensitive personal information, for example sequences containing the phrase my credit card number is ...". In this work, we propose the first attack on FL that achieves targeted extraction of sequences that contain privacy-critical phrases, whereby we employ maliciously modified parameters to allow the transformer itself to filter relevant sequences from aggregated user data and encode them in the gradient update. Our attack can effectively extract sequences of interest even against extremely large-scale aggregation. Hong-Min Chu · Jonas Geiping · Liam Fowl · Micah Goldblum · Tom Goldstein 🔗 - Despite "super-human" performance, current LLMs are unsuited for decisions about ethics and safety (Poster) []  []  Large language models (LLMs) have exploded in popularity in the past few years and have achieved undeniably impressive results on benchmarks as varied as question answering and text summarization. We provide a simple new prompting strategy that leads to yet another supposedly “super-human” result, this time outperforming humans at common sense ethical reasoning (as measured by accuracy on a subset of the ETHICS dataset). Unfortunately, we find that relying on average performance to judge capabilities can be highly misleading. LLM errors differ systematically from human errors in ways that make it easy to craft adversarial examples, or even perturb existing examples to flip the output label. We also observe signs of inverse scaling with model size on some examples, and show that prompting models to “explain their reasoning” often leads to alarming justifications of unethical actions. Our results highlight how human-like performance does not necessarily imply human-like understanding or reasoning. Josh Albrecht · Ellie Kitanidis · Abraham Fetterman 🔗 - Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small (Poster) []  Research in mechanistic interpretability seeks to explain behaviors of ML models in terms of their internal components. However, most previous work either focuses on simple behaviors in small models, or describes complicated behaviors in larger models with broad strokes. In this work, we bridge this gap by presenting an explanation for how GPT-2 small performs a natural language task that requires logical reasoning: indirect object identification (IOI). Our explanation encompasses 28 attention heads grouped into 7 main classes, which we discovered using a combination of interpretability approaches including causal interventions and projections. To our knowledge, this investigation is the largest end-to-end attempt at reverse-engineering a natural behavior in the wild" in a language model. We evaluate the reliability of our explanation using three quantitative criteria--faithfulness, completeness and minimality. Though these criteria support our explanation, they also point to remaining gaps in our understanding. Our work provides evidence that a mechanistic understanding of large ML models is feasible, opening opportunities to scale our understanding to both larger models and more complex tasks. Kevin Wang · Alexandre Variengien · Arthur Conmy · Buck Shlegeris · Jacob Steinhardt 🔗 - From plane crashes to algorithmic harm: applicability of safety engineering frameworks for responsible ML (Poster) []  []  Inappropriate design and deployment of machine learning (ML) systems leads to negative downstream social and ethical impact -- described here as social and ethical risks -- for users, society and the environment. Despite the growing need to regulate ML systems, current processes for assessing and mitigating risks are disjointed and inconsistent. We interviewed 30 industry practitioners on their current social and ethical risk management practices, and collected their first reactions on adapting safety engineering frameworks into their practice -- namely, System Theoretic Process Analysis (STPA) and Failure Mode and Effects Analysis (FMEA). Our findings suggest STPA/FMEA can provide appropriate structure toward social and ethical risk assessment and mitigation processes. However, we also find nontrivial challenges in integrating such frameworks in the fast-paced culture of the ML industry. We call on the ML research community to strengthen existing frameworks and assess their efficacy, ensuring that ML systems are safer for all people. Shalaleh Rismani · Renee Shelby · Andrew Smart · Edgar Jatho · Joshua Kroll · AJung Moon · Negar Rostamzadeh 🔗 - Best of Both Worlds: Towards Adversarial Robustness with Transduction and Rejection (Poster) []  []  Both transduction and rejection have emerged as key techniques to enable stronger defenses against adversarial perturbations, but existing work has not investigated the combination of transduction and rejection. Our theoretical analysis shows that combining the two can potentially lead to better guarantees than using transduction or rejection alone. Based on the analysis, we propose a defense algorithm that learns a transductive classifier with the rejection option and also propose a strong adaptive attack for evaluating our defense. The experimental results on MNIST and CIFAR-10 show that it has strong robustness, outperforming existing baselines, including those using only transduction or rejection. Nils Palumbo · Yang Guo · Xi Wu · Jiefeng Chen · Yingyu Liang · Somesh Jha 🔗 - c-MBA: Adversarial Attack for Cooperative MARL Using Learned Dynamics Model (Poster) []  []  In recent years, a proliferation of methods were developed for cooperative multi-agent reinforcement learning (c-MARL). However, the robustness of c-MARL agents against adversarial attacks has been rarely explored. In this paper, we propose to evaluate the robustness of c-MARL agents via a model-based approach, named \textbf{c-MBA}. Our proposed attack can craft much stronger adversarial state perturbations of c-MARL agents to lower total team rewards than existing model-free approaches. Our numerical experiments on two representative MARL benchmarks illustrate the advantage of our approach over other baselines: our model-based attack consistently outperforms other baselines in all tested environments. Nhan H Pham · Lam Nguyen · Jie Chen · Thanh Lam Hoang · Subhro Das · Lily Weng 🔗 - Adversarial Attacks on Feature Visualization Methods (Poster) []  The internal functional behavior of trained Deep Neural Networks is notoriously difficult to interpret. Feature visualization approaches are one set of techniques used to interpret and analyze trained deep learning models. On the other hand interpretability methods themselves may be subject to be deceived. In particular, we consider the idea of an adversary manipulating a model for the purpose of deceiving the interpretation. Focusing on the popular feature visualizations associated with CNNs we introduce an optimization framework for modifying the outcome of feature visualization methods. Michael Eickenberg · Eugene Belilovsky · Jonathan Marty 🔗 - Embedding Reliability: On the Predictability of Downstream Performance (Poster) []  []  In (self-)supervised (pre-)training, such as in contrastive learning, often a network is presented with correspondent (positive) and non-correspondent (negative) pairs of datapoints, and is trained to find an embedding vector for each datapoint, i.e., a representation, which can be further fine-tuned for various downstream tasks. To safely deploy these models in critical decision-making systems, it is crucial to equip them with a measure of their reliability. Here we study whether such measures can be quantified for a datapoint in a meaningful way. In other words, we explore if the downstream performance on a given datapoint is predictable, directly from a few characteristics of its pre-trained embedding.We study whether this goal can be achieved by directly estimating the distribution of the training data in the embedding space, and accounting for the local consistency of the representations. Our experiments show that these notions of reliability often strongly correlate with its downstream accuracy. Shervin Ardeshir · Navid Azizan 🔗 - On The Fragility of Learned Reward Functions (Poster) []  Reward functions are notoriously difficult to specify, especially for tasks with complex goals. Reward learning approaches attempt to infer reward functions from human feedback and preferences. Prior works on reward learning have mainly focused on the performance of policies trained alongside the reward function. This practice, however, may fail to detect learned rewards that are not capable of training new policies from scratch and thus do not capture the intended behavior. Our work focuses on demonstrating and studying the causes of these relearning failures in the domain of preference-based reward learning. We demonstrate with experiments in tabular and continuous control environments that the severity of relearning failures can be sensitive to changes in reward model design and the trajectory dataset composition. Based on our findings, we emphasize the need for more retraining-based evaluations in the literature. Lev McKinney · Yawen Duan · David Krueger · Adam Gleave 🔗 - Decepticons: Corrupted Transformers Breach Privacy in Federated Learning for Language Models (Poster) []  Privacy is a central tenet of Federated learning (FL), in which a central server trains models without centralizing user data. However, gradient updates used in FL can leak user information. While the most industrial uses of FL are for text applications (e.g. keystroke prediction), the majority of attacks on user privacy in FL have focused on simple image classifiers and threat models that assume honest execution of the FL protocol from the server. We propose a novel attack that reveals private user text by deploying malicious parameter vectors, and which succeeds even with mini-batches, multiple users, and long sequences. Unlike previous attacks on FL, the attack exploits characteristics of both the Transformer architecture and the token embedding, separately extracting tokens and positional embeddings to retrieve high-fidelity text. Liam Fowl · Jonas Geiping · Steven Reich · Yuxin Wen · Wojciech Czaja · Micah Goldblum · Tom Goldstein 🔗 - Adversarial Robustness for Tabular Data through Cost and Utility Awareness (Poster) []  Many machine learning applications (credit scoring, fraud detection, etc.) use data in the tabular domains. Adversarial examples can be especially damaging for these applications. Yet, existing works on adversarial robustness mainly focus on machine-learning models in the image and text domains. We argue that due to the differences between tabular data and images or text, existing threat models are inappropriate for tabular domains. These models do not capture that cost can be more important than imperceptibility, nor that the adversary could ascribe different value to the utility obtained from deploying different adversarial examples. We show that due to these differences the attack and defence methods used for images and text cannot be directly applied to the tabular setup. We address these issues by proposing new cost and utility-aware threat models tailored to capabilities and constraints of attackers targeting tabular domains. We show that our approach is effective on two tabular datasets corresponding to applications for which adversarial examples can have economic and social implications. Klim Kireev · Bogdan Kulynych · Carmela Troncoso 🔗 - Epistemic Side Effects & Avoiding Them (Sometimes) (Poster) []  []  AI safety research has investigated the problem of negative side effects -- undesirable changes made by AI systems in pursuit of an underspecified objective. However, the focus has been on physical side effects, such as a robot breaking a vase while moving. In this paper we introduce the notion of epistemic side effects, unintended changes made to the knowledge or beliefs of agents, and describe a way to avoid negative epistemic side effects in reinforcement learning, in some cases. Toryn Klassen · Parand Alizadeh Alamdari · Sheila McIlraith 🔗 - Improving the Robustness of Conditional Language Models by Detecting and Removing Input Noise (Poster) []  The evaluation of conditional language modeling tasks such as abstractive summarization typically uses test data that is identically distributed as training. In real-world practice, documents to be summarized may contain input noise caused by text extraction artifacts or data pipeline bugs. The robustness of model performance under distribution shift caused by such noise is relatively under-studied. We present a large empirical study quantifying the sometimes severe loss in performance (up to 12 ROUGE-1 points) from different types of input noise for a range of datasets and model sizes. We then propose a light-weight method for detecting and removing such noise in the input during model inference without requiring any extra training or auxiliary models, which effectively mitigates the loss in performance, recovering up to 11 ROUGE-1 points. Kundan Krishna · Yao Zhao · Jie Ren · Balaji Lakshminarayanan · Jiaming Luo · Mohammad Saleh · Peter Liu 🔗 - Diagnostics for Deep Neural Networks with Automated Copy/Paste Attacks (Poster) []  []  Deep neural networks (DNNs) are powerful, but they can make mistakes that pose risks. A model performing well on a test set does not imply safety in deployment, so it is important to have additional evaluation tools to understand flaws. Adversarial examples can help reveal weaknesses, but they are often difficult for a human to interpret or draw generalizable conclusions from. Some previous works have addressed this by studying human-interpretable attacks. We build on these with three contributions. First, we introduce a method termed Search for Natural Adversarial Features Using Embeddings (SNAFUE) which offers a fully-automated method for finding "copy/paste" attacks in which one natural image can be pasted into another in order to induce an unrelated misclassification. Second, we use this to red team an ImageNet classifier and identify hundreds of easily-describable sets of vulnerabilities. Third, we compare this approach with other interpretability tools by attempting to rediscover trojans. Our results suggest that SNAFUE can be useful for interpreting DNNs and generating adversarial data for them. Stephen Casper · Kaivalya Hariharan · Dylan Hadfield-Menell 🔗 - On Representation Learning Under Class Imbalance (Poster) []  Unlike carefully curated academic benchmarks, real-world datasets are often highly class-imbalanced, especially in safety-critical scenarios. Through extensive empirical investigation, we study a number of foundational learning behaviors for various models such as neural networks, gradient-boosted decision trees, and SVMs under class imbalance across a range of domains. Motivated by our observation that re-balancing class-imbalanced training data is ineffective, we show that several simple techniques for improving representation learning are effective in this setting: (1) self-supervised pre-training is insensitive to imbalance and can be used for feature learning before fine-tuning on labels; (2) Bayesian inference is effective because neural networks are especially underspecified under class imbalance; (3) flatness-seeking regularization pulls decision boundaries away from minority samples, especially when we seek minima that are particularly flat on the minority samples’ loss. Ravid Shwartz-Ziv · Micah Goldblum · Yucen Li · C. Bayan Bruss · Andrew Gordon Wilson 🔗 - Neural Autoregressive Refinement for Self-Supervised Anomaly Detection in Accelerator Physics (Poster) []  We propose a novel data refinement (DR) scheme that relies on neural autoregressive flows (NAF) for self-supervised anomaly detection. Flow-based models allow us to explicitly learn the probability density and thus can assign accurate likelihoods to normal data which makes it usable to detect anomalies. The proposed NAF-DR method is achieved by efficiently generating random samples from latent space and transforming them into feature space along with likelihoods via invertible mapping. The augmented samples incorporated with normal samples are used to train a better detector to approach decision boundaries. Compared with random transformations, NAF-DR can be interpreted as a likelihood-oriented data augmentation that is more efficient and robust. Extensive experiments show that our approach outperforms existing baselines on multiple tabular and time series datasets, and one real-world application in accelerator physics, significantly improving accuracy and robustness over the state-of-the-art baselines. Jiaxin Zhang 🔗 - Robust Representation Learning for Group Shifts and Adversarial Examples (Poster) []  []  Despite the high performance achieved by deep neural networks on various tasks, extensive research has demonstrated that small tweaks in the inputs can lead to failure in the model's predictions. This issue affecting deep neural networks has led to a number of methods to improve model robustness, including adversarial training and distributionally robust optimization. Although both of these two methods are geared towards learning robust models, they have essentially different motivations: adversarial training attempts to train deep neural networks against perturbations, while distributional robust optimization aims to improve model performance on the most difficult uncertain distributions". In this work, we propose an algorithm that combines adversarial training and group distribution robust optimization to improve robust representation learning. Experiments on three image benchmark datasets illustrate that the proposed method achieves superior results on robust metrics without sacrificing much of the standard measures. Ming-Chang Chiu · Xuezhe Ma 🔗 - DP-InstaHide: Data Augmentations Provably Enhance Guarantees Against Dataset Manipulations (Poster) []  []  Data poisoning and backdoor attacks manipulate training data to induce security breaches in a victim model. These attacks can be provably deflected using differentially private (DP) training methods, although this comes with a sharp decrease in model performance. The InstaHide method has recently been proposed as an alternative to DP training that leverages supposed privacy properties of the mixup augmentation, although without rigorous guarantees. In this paper, we rigorously show that $k$-way mixup provably yields at least $k$ times stronger DP guarantees than a naive DP mechanism, and we observe that this enhanced privacy guarantee is a strong foundation for building defenses against poisoning. Eitan Borgnia · Jonas Geiping · Valeriia Cherepanova · Liam Fowl · Arjun Gupta · Amin Ghiasi · Furong Huang · Micah Goldblum · Tom Goldstein 🔗 - Geometric attacks on batch normalization (Poster) []  []  Constructing adversarial examples usually requires labels, which provide a loss gradient to construct the example. We show that for batch normalized architectures, intermediate latents that are produced after a batch normalization step suffice to produce adversarial examples using an intermediate loss solely utilizing angular deviations, without any label. We motivate our loss through the geometry of batch normed representations and concentration on a known hypersphere. Our losses build on and expand intermediate latent based attacks that usually require labels. The success of our method implies that leakage of intermediate representations may suffice to create a security breach for deployed models, which persist even when the model is transferred to downstream usage. We further show that removal of batch norm weakens our attack significantly, suggesting that batch norm's contribution to adversarial vulnerability may be understood by analyzing such attacks. Amur Ghose · Apurv Gupta · Yaoliang Yu · Pascal Poupart 🔗 - Improving Adversarial Robustness via Joint Classification and Multiple Explicit Detection Classes (Poster) []  []  This work concerns the development of deep networks that are certifiably robust to adversarial attacks. Joint robust classification-detection was recently introduced as a certified defense mechanism, where adversarial examples are either correctly classified or assigned to the abstain'' class. In this work, we show that such a provable framework can be extended to networks with multiple explicit abstain classes, where the adversarial examples are adaptively assigned to those. While naively adding multiple abstain classes can lead tomodel degeneracy'', we propose a regularization approach and a training method to counter this degeneracy by promoting full use of the multiple abstain classes. Our experiments demonstrate that the proposed approach consistently achieves favorable standard vs. robust verified accuracy tradeoffs, outperforming state-of-the-art algorithms for various choices of number of detection classes. Sina Baharlouei · Fatemeh Sheikholeslami · Meisam Razaviyayn · J. Zico Kolter 🔗 - On the Abilities of Mathematical Extrapolation with Implicit Models (Poster) []  []  Deep neural networks excel on a variety of different tasks, often surpassing human intelligence. However, when presented with out-of-distribution data, these models tend to break down even on the simplest tasks. In this paper, we compare the robustness of implicitly-defined and classical deep learning models on a series of mathematical extrapolation tasks, where the models are tested with out-of-distribution samples during inference time. Throughout our experiments, implicit models greatly outperform classical deep learning networks that overfit the training distribution. We present implicit models as a safer deep learning framework for generalization due to their flexible and selective structure. Implicit models, with potentially unlimited depth, not only adapt well to out-of-distribution data but also understand the underlying structure of inputs much better. Juliette Decugis · Alicia Tsai · Ashwin Ganesh · Max Emerling · Laurent El Ghaoui 🔗 - Netflix and Forget: Fast Severance From Memorizing Training Data in Recommendations (Poster) []  Suppose a person, who has streamed rom-coms exclusively with their significantother, suddenly breaks up.Consider an expecting mom, who has shopped for baby clothes, miscarries.Their streaming and shopping recommendations, however, do not necessarily update, serving as unhappy reminders of their loss.One approach is to implement the Right To Be Forgotten for recommendation systems built from user data, with the goal of updating downstream recommendations to reflect the removal without incurring the cost of re-training.Inspired by solutions to the original Netflix challenge~\citep{koren2009bellkor}, we develop Unlearn-ALS, which is more aggressively forgetful of select data than fine-tuning. In theory, it is consistent with retraining without model degradation. Empirically, it shows fast convergence, and can be applied directly to any bi-linear models regardless of the training procedure. Xinlei XU · Jiankai Sun · Xin Yang · Yuanshun Yao · Chong Wang 🔗 - Evaluating Worst Case Adversarial Weather Perturbations Robustness (Poster) []  []  Several algorithms are proposed to improve the robustness of deep neural networks against adversarial perturbations beyond $\ell_p$ cases, i.e. weather perturbations. However, evaluations of existing robust training algorithms are over-optimistic. This is in part due to the lack of a standardized evaluation protocol across various robust training algorithms, leading to ad-hoc methods that test robustness on either random perturbations or the adversarial samples from generative models that are used for robust training, which is either uninformative of the worst case, or is heavily biased.In this paper, we identify such evaluation bias in these existing works and propose the first standardized and fair evaluation that compares various robust training algorithms by using physics simulators for common adverse weather effects i.e. rain and snow.With this framework, we evaluated several existing robust training algorithms on two streetview classification datasets (BIC\_GSV, Places365) and show the evaluation bias in experiments. Yihan Wang · Yunhao Ba · Howard Zhang · Huan Zhang · Achuta Kadambi · Stefano Soatto · Alex Wong · Cho-Jui Hsieh 🔗 - Introspection, Updatability, and Uncertainty Quantification with Transformers: Concrete Methods for AI Safety (Poster) []  []  When deploying Transformer networks, we seek the ability to introspect the predictions against instances with known labels; update the model without a full re-training; and provide reliable uncertainty quantification over the predictions. We demonstrate that these properties are achievable via recently proposed approaches for approximating deep neural networks with instance-based metric learners, at varying resolutions of the input, and the associated Venn-ADMIT Predictor for constructing prediction sets. We consider a challenging (but non-adversarial) task: Zero-shot sequence labeling (i.e., feature detection) in a low-accuracy, class-imbalanced, covariate-shifted setting while requiring a high confidence level. Allen Schmaltz · Danielle Rasooly 🔗 - BAAT: Towards Sample-specific Backdoor Attack with Clean Labels (Poster) []  []  Recent studies revealed that the training process of deep neural networks (DNNs) is vulnerable to backdoor attacks if third-party training resources are adopted. Among all different types of existing attacks, sample-specific backdoor attacks (SSBAs) are probably the most advanced and malicious methods, since they can easily bypass most of the existing defenses. In this paper, we reveal that SSBAs are not stealthy enough due to their poisoned-label nature, where users can discover anomalies if they check the image-label relationship. Besides, we also show that extending existing SSBAs to the ones under the clean-label setting based on poisoning samples from only the target class has minor effects. Inspired by the decision process of humans, we propose to adopt \emph{attribute} as the trigger to design the sample-specific backdoor attack with clean labels (dubbed BAAT). Experimental results on benchmark datasets verify the effectiveness and stealthiness of BAAT. Yiming Li · Mingyan Zhu · Chengxiao Luo · Haiqing Weng · Yong Jiang · Tao Wei · Shu-Tao Xia 🔗 - Avoiding Calvinist Decision Traps using Structural Causal Models (Poster) []  Causal Decision Theory (CDT) is a popular choice among practical decision theorists. While its successes and failings have been extensively studied, a less investigated topic is how CDT's choices hinge on the theory of causation used. The most common interpretation, temporal CDT, understands causation as a description of physical processes ordered in time. Another emerging view comes from the graphical framework of Structural Causal Models (SCM), which sees causation in terms of constraints on sources of variation in a system. We present an adversarial scheme where a CDT agent facing a Bandit problem can be tricked into sub-optimal choices, if it follows temporal CDT. We then propose an axiom to ground the orientation of arrows in the causal graph of a decision problem. In doing so, we resolve an ambiguity in the theory of SCMs, and underscore the importance of agent-perspectives, which have been largely ignored in the causal inference literature. We also demonstrate how this structural CDT avoids our adversarial trap, and outperforms temporal CDT in a series of canonical decision problems. Arvind Raghavan 🔗 - Out-Of-Distribution Detection Is Not All You Need (Poster) []  []  The usage of deep neural networks in critical systems is limited by our ability to guarantee their correct behavior. Runtime monitors are components aiming to identify unsafe predictions before they can lead to catastrophic consequences. Several recent works on runtime monitoring have focused on out-of-distribution (OOD) detection, i.e., identifying inputs that are different from the training data. In this work, we argue that OOD detection is not a well-suited framework to design efficient runtime monitors and that it is more relevant to evaluate monitors based on their ability to discard incorrect predictions. We discuss the conceptual differences with OOD and conduct extensive experiments on popular datasets to show that: 1. good OOD results can give a false impression of safety, 2. comparison under the OOD setting does not allow identifying the best monitor to detect errors. Joris Guerin · Kevin Delmas · Raul S Ferreira · Jérémie Guiochet 🔗 - Revisiting Robustness in Graph Machine Learning (Poster) []  []  Many works show that node-level predictions of Graph Neural Networks (GNNs) are unrobust to small, often termed adversarial, changes to the graph structure. However, because manual inspection of a graph is difficult, it is unclear if the studied perturbations always preserve a core assumption of adversarial examples: that of unchanged semantic content. To address this problem, we introduce a more principled notion of an adversarial graph, which is aware of semantic content change. Using Contextual Stochastic Block Models (CSBMs) and real-world graphs, our results uncover: i) for a majority of nodes the prevalent perturbation models include a large fraction of perturbed graphs violating the unchanged semantics assumption; ii) surprisingly, all assessed GNNs show over-robustness - that is robustness beyond the point of semantic change. We find this to be a complementary phenomenon to adversarial robustness related to the small degree of nodes and their class membership dependence on the neighbourhood structure. Lukas Gosch · Daniel Sturm · Simon Geisler · Stephan Günnemann 🔗 - Deep Reinforcement Learning Policies in the Frequency Domain (Poster) []  Reinforcement learning policies based on deep neural networks are vulnerable to imperceptible adversarial perturbations to their inputs, in much the same way as neural network image classifiers. Recent work has proposed several methods for adversarial training for deep reinforcement learning agents to improve robustness to adversarial perturbations. In this paper, we study the effects of adversarial training on the neural policy learned by the agent. In particular, we compare the Fourier spectrum of minimal perturbations computed for both adversarially trained and vanilla trained neural policies. Via experiments in the OpenAI Atari environments we show that minimal perturbations computed for adversarially trained policies are more focused on lower frequencies in the Fourier domain, indicating a higher sensitivity of these policies to low frequency perturbations. We believe our results can be an initial step towards understanding the relationship between adversarial training and different notions of robustness for neural policies. Ezgi Korkmaz 🔗 - Assistance with large language models (Poster) []  []  A core part of AI alignment is training AI systems to be helpful, or more generally, to interact with humans appropriately. We look at this problem in the context of large language models. Past works have focused on training these models to perform specific tasks, or follow instructions. In contrast, we believe helpfulness requires back-and-forth interaction between the AI and the human it is trying to assist. Here, we consider a multi-step interaction in which a human asks a question, and the AI has an opportunity to ask a clarifying question to resolve ambiguities before responding. The assistance framework formalizes the idea of an AI which aims to maximize the human's reward but is ignorant of the human reward function. Past works solved toy assistance environments using exact POMDP solvers as well as deep reinforcement learning. We apply a behavioral cloning approach, and fine-tune GPT-3 such that it can respond to clear input questions directly, clarify the intent behind vague input questions, and respond based on the clarification it receives. We show that this approach leads to quantitative improvements in answer accuracy compared to a baseline that cannot ask for clarifications. While the assistance framework assumes the correct behavior of an AI is to infer and maximize a human's reward, our approach can be used to learn any interaction protocol between the AI and the human. We believe exploring interaction protocols that are easy to learn robustly, and can be used to "bootstrap" further alignment are a promising direction for future research. Dmitrii Krasheninnikov · Egor Krasheninnikov · David Krueger 🔗 - Policy Resilience to Environment Poisoning Attack on Reinforcement Learning (Poster) []  []  This paper investigates policy resilience to training-environment poisoning attacks on reinforcement learning (RL) policies, with the goal of recovering the deployment performance of a poisoned RL policy. Due to the fact that policy resilience is an add-on concern to RL algorithms, it must be resource-efficient, time-conserving, and widely applicable without compromising the performance of RL algorithms.This paper proposes such a policy-resilience mechanism based on an idea of sharing the environment knowledge. We summarize the policy resilience as three stages: preparation, diagnosis, recovery. Specifically, we design the mechanism as a federated architecture coupled with a meta-learning approach, pursuing an efficient extraction and sharing of environment knowledge. With the shared knowledge, a poisoned agent can quickly identify the deployment condition and accordingly recover its policy performance. We empirically evaluate the resilience mechanism for both model-based and model-free RL algorithms, showing its effectiveness and efficiency in restoring the deployment performance of a poisoned policy. Hang Xu · Zinovi Rabinovich 🔗 - Image recognition time for humans predicts adversarial vulnerability for models (Poster) []  The success of adversarial attacks and the performance tradeoffs made by adversarial defense methods have both traditionally been evaluated on image test sets constructed from a randomly sampled held out portion of a training set. Mayo 2022 et al. [1] measured the difficulty of the ImageNet and ObjectNet test sets by measuring the minimum viewing time required for an object to be recognized on average by a human, finding that these test sets are heavily skewed towards containing mostly easy, quickly recognized images. While difficult images that require longer viewing times to be recognized are uncommon in test sets, they are both common and critically important to the real world performance of vision models. In this work, we investigated the relationship between adversarial robustness and viewing time difficulty. Measuring the AUC of accuracy vs attack strength (epsilon), we find that easy, quickly recognized, images are more robust to adversarial attacks than difficult images, which require several seconds of viewing time to recognize. Additionally, adversarial defense methods improve models robustness to adversarial attacks on easy images significantly more than on hard images. We propose that the distribution of image difficulties should be carefully considered and controlled for when measuring both the effectiveness of adversarial attacks and when analyzing the clean accuracy vs robustness tradeoff made by adversarial defense methods. David Mayo · Jesse Cummings · Xinyu Lin · Boris Katz · Andrei Barbu 🔗 - Rational Multi-Objective Agents Must Admit Non-Markov Reward Representations (Poster) []  []  This paper considers intuitively appealing axioms for rational, multi-objective agents and derives an impossibility from which one concludes that such agents must admit non-Markov reward representations. The axioms include the Von-Neumann Morgenstern axioms, Pareto indifference, and dynamic consistency. We tie this result to irrational procrastination behaviors observed in humans, and show how the impossibility can be resolved by adopting a non-Markov aggregation scheme. Our work highlights the importance of non-Markov rewards for reinforcement learning and outlines directions for future work. Silviu Pitis · Duncan Bailey · Jimmy Ba 🔗 - Runtime Monitors for Operational Design Domains of Black-Box ML-Models (Poster) []  []  Autonomous systems are increasingly relying on machine learning (ML) components to perform a variety of complex tasks in perception, prediction, and control. To guarantee the safety of ML-based autonomous systems, it is important to capture their operational design domain (ODD), i.e., the conditions under which using the ML components does not endanger the safety of the system. In this paper, we present a framework for learning runtime monitors for ODDs of autonomous systems with black-box ML components. A runtime monitor of an ODD predicts based on a sequence of monitorable observationswhether the system is about to exit its ODD. We particularly investigate the learning of optimal monitors based on counterexample-guided refinement and conformance testing. We evaluate our approach on a case study from the domain of autonomous driving. Hazem Torfah · Sanjit A. Seshia 🔗 - Unifying Grokking and Double Descent (Poster) []  []  Building a principled understanding of generalization in deep learning requires unifying disparate observations under a single conceptual framework. Previous work has studied grokking, a training dynamic in which a sustained period of near-perfect training performance and near-chance test performance is eventually followed by generalization, as well as the superficially similar double descent. These topics have so far been studied in isolation. We hypothesize that grokking and double descent can be understood as instances of the same learning dynamics within a framework of pattern learning speeds, and that this framework also applies when varying model capacity instead of optimization steps. We confirm some implications of this hypothesis empirically, including demonstrating model-wise grokking. Xander Davies · Lauro Langosco · David Krueger 🔗 - Assessing Robustness of Image Recognition Models to Changes in the Computational Environment (Poster) []  []  Image recognition tasks typically use deep learning and require enormous processing power, thus relying on hardware accelerators like GPUs and TPUs for fast, timely processing. Failure in real-time image recognition tasks can occur due to incorrect mapping on hardware accelerators, which may lead to timing uncertainty and incorrect behavior. In addition, the increasing demand for optimal performance has led to progress towards the optimization of different neural network operations, such as operator fusion.Owing to the increased use of image recognition tasks in safety-critical applications like autonomous driving and medical imaging, it is imperative to assess the performance and impact of such optimizations, and explore their effectiveness. In this paper we conduct robustness analysis of four popular image recognition models with the ImageNet dataset, assessing the impact of the compiler optimizations applied, utilizing different Deep Learning frameworks and executing on hardware devices of varying capabilities. We report the impact in terms of misclassifications and inference time across varying settings. Nikolaos Louloudakis · Perry Gibson · José Cano · Ajitha Rajan 🔗 - Fake It Until You Make It : Towards Accurate Near-Distribution Novelty Detection (Poster) []  []  We aim for image-based novelty detection. Despite considerable progress, existing models either fail or face dramatic drop under the so-called near-distribution" setup, where the differences between normal and anomalous samples are subtle. We first demonstrate existing methods could experience up to 20\% decrease in their AUCs in the near-distribution setting. Next, we propose to exploit a score-based generative model to produce synthetic near-distribution anomalous data. Our model is then fine-tuned to distinguish such data from the normal samples. We make quantitative as well as qualitative evaluation of this strategy, and compare the results with a variety of GAN-based models. Effectiveness of our method for both near-distribution and standard novelty detection is assessed through extensive experiments on datasets in diverse applications such as medical images, object classification, and quality control. This reveals that our method significantly improves upon existing models, and consistently decreases the gap between the near-distribution and standard novelty detection AUCs by a considerable amount. Hossein Mirzaei · Mohammadreza Salehi · Sajjad Shahabi · Efstratios Gavves · Cees Snoek · Mohammad Sabokrou · Mohammad Hossein Rohban 🔗 - Revisiting Hyperparameter Tuning with Differential Privacy (Poster) []  []  Hyperparameter tuning is a common practice in the application of machine learning but is a typically ignored aspect in the literature on privacy-preserving machine learning due to its negative effect on the overall privacy parameter. In this paper, we aim to tackle this fundamental yet challenging problem by providing an effective hyperparameter tuning framework with differential privacy. The proposed method allows us to adopt a broader hyperparameter search space and even to perform a grid search over the whole space, since its privacy loss parameter is independent of the number of hyperparameter candidates. Interestingly, it instead correlates with the utility gained from hyperparameter searching, revealing an explicit and mandatory trade-off between privacy and utility. Theoretically, we show that its additional privacy loss bound incurred by hyperparameter tuning is upper-bounded by the squared root of the gained utility. However, we note that the additional privacy loss bound would empirically scale like a squared root of the logarithm of the utility term, benefiting from the design of doubling step. Youlong Ding · Xueyang Wu 🔗