Workshop
Workshop on Distribution Shifts: New Frontiers with Foundation Models
Rebecca Roelofs · Fanny Yang · Hongseok Namkoong · Masashi Sugiyama · Jacob Eisenstein · Pang Wei Koh · Shiori Sagawa · Tatsunori Hashimoto · Yoonho Lee
Room R06-R09 (level 2)
Tagline: This workshop focuses on distribution shifts in the context of foundation models.Distribution shifts---where a model is deployed on a data distribution different from what it was trained on---pose significant robustness challenges in real-world ML applications. Such shifts are often unavoidable in the wild and have been shown to substantially degrade model performance in a wide range of applications. For example, models can systematically fail when tested on patients from different hospitals or people from different demographics. Training models that are robust to such distribution shifts is a rapidly growing area of interest in the ML community, and the goal of our workshop is to foster discussions and further research on distribution shifts. In the context of distribution shifts, our workshop this year focuses on foundation models: large pretrained models that can be adapted for a wide range of tasks. Foundation models open up an exciting new frontier in the study of distribution shifts, raising open research questions such as how pre-training improves robustness, how to finetune foundation models for increased robustness, how to leverage foundation models’ generative capabilities for robustness, and how to handle discrepancies between standard pre-training distributions and downstream distributions of interest. We aim to facilitate discussions around these topics by bringing together researchers working on distribution shifts and foundation models.
Schedule
Fri 7:00 a.m. - 7:10 a.m.
|
Opening Remarks
(
Opening Remarks
)
>
|
🔗 |
Fri 7:10 a.m. - 7:35 a.m.
|
Invited Talk 1
(
Invited Talk
)
>
SlidesLive Video |
Peng Cui 🔗 |
Fri 7:35 a.m. - 8:00 a.m.
|
Invited Talk 2
(
Invited Talk
)
>
SlidesLive Video |
Kate Saenko 🔗 |
Fri 8:00 a.m. - 8:30 a.m.
|
Coffee Break
(
Break
)
>
|
🔗 |
Fri 8:30 a.m. - 10:00 a.m.
|
Poster Session
(
Poster Session
)
>
|
🔗 |
Fri 10:00 a.m. - 11:15 a.m.
|
Lunch Break
(
Break
)
>
|
🔗 |
Fri 11:15 a.m. - 11:25 a.m.
|
TiC-CLIP: Continual Training of CLIP Models
(
Oral
)
>
link
SlidesLive Video
Keeping large foundation models up to date on latest data is inherently expensive. To avoid the prohibitive costs of constantly retraining, it is imperative to continually train these models. This problem is exacerbated by the lack of any large scale continual learning benchmarks or baselines. We introduce the first set of web-scale Time-Continual (TiC) benchmarks for training vision-language models: TIC-DataComp, TIC-YFCC, and TIC-RedCaps with over 12.7B timestamped image-text pairs spanning 9 years (2014--2022). We first use our benchmarks to curate various dynamic evaluations to measure temporal robustness of existing models. We show OpenAI’s CLIP (trained on data up to 2020) loses $\approx 8%$ zero-shot accuracy on our curated retrieval task from 2021--2022 compared with more recently trained models in OpenCLIP repository. We then study how to efficiently train models on time-continuous data. We demonstrate that a simple rehearsal-based approach that continues training from the last checkpoint and replays old data reduces compute by $2.5\times$ when compared to the standard practice of retraining from scratch.
|
Saurabh Garg · Mehrdad Farajtabar · Hadi Pouransari · Raviteja Vemulapalli · Sachin Mehta · Oncel Tuzel · Vaishaal Shankar · Fartash Faghri 🔗 |
Fri 11:25 a.m. - 11:35 a.m.
|
LLM Routing with Benchmark Datasets
(
Oral
)
>
link
SlidesLive Video There is a rapidly growing number of open-source Large Language Models (LLMs) and benchmark datasets to compare them. While some models dominate these benchmarks, no single model typically achieves the best accuracy in all tasks and use cases. In this work, we address the challenge of selecting the best LLM out of a collection of models for new tasks. We propose a new formulation for the problem, in which benchmark datasets are repurposed to learn a ``router'' model for this LLM selection, and we show that this problem can be reduced to a collection of binary classification tasks. We demonstrate the utility and limitations of learning model routers from various benchmark datasets. |
Tal Shnitzer · Anthony Ou · Mírian Silva · Kate Soule · Yuekai Sun · Justin Solomon · Neil Thompson · Mikhail Yurochkin 🔗 |
Fri 11:35 a.m. - 11:45 a.m.
|
Does CLIP’s generalization performance mainly stem from high train-test similarity?
(
Oral
)
>
link
SlidesLive Video Foundation models like CLIP are trained on hundreds of millions of samples and effortlessly generalize to new tasks and inputs. Out of the box, CLIP shows stellar zero-shot and few-shot capabilities on a wide range of out-of-distribution (OOD) benchmarks, which prior works attribute mainly to today's large and comprehensive training dataset (like LAION). However, it is questionable how meaningful terms like out-of-distribution generalization are for CLIP as it seems likely that web-scale datasets like LAION simply contain many samples that are similar to common OOD benchmarks originally designed for ImageNet. To test this hypothesis, we retrain CLIP on pruned LAION splits that replicate ImageNet’s train-test similarity with respect to common OOD benchmarks. While we observe a performance drop on some benchmarks, surprisingly, CLIP’s overall performance remains high. This shows that high train-test similarity is insufficient to explain CLIP’s performance. |
Prasanna Mayilvahanan · Thaddäus Wiedemer · Evgenia Rusak · Matthias Bethge · Wieland Brendel 🔗 |
Fri 11:45 a.m. - 11:55 a.m.
|
Domain constraints improve risk prediction when outcome data is missing
(
Oral
)
>
link
SlidesLive Video Machine learning models often predict the outcome resulting from a human decision. For example, if a doctor tests a patient for disease, will the patient test positive? A challenge is that the human decision censors the outcome data: we only observe test outcomes for patients doctors historically tested. Untested patients, for whom outcomes are unobserved, may differ from tested patients along observed and unobserved dimensions. We propose a Bayesian model to capture this setting whose purpose is to estimate risk for both tested and untested patients. To aid model estimation, we propose two domain-specific constraints which are plausible in health settings: a prevalence constraint, where the overall disease prevalence is known, and an expertise constraint, where the human decision-maker deviates from purely risk-based decision-making only along a constrained feature set. We show theoretically and on synthetic data that the constraints can improve parameter inference. We apply our model to a case study of cancer risk prediction, showing that the model can identify suboptimalities in test allocation and that the prevalence constraint increases the plausibility of inferences. |
Sidhika Balachandar · Nikhil Garg · Emma Pierson 🔗 |
Fri 11:55 a.m. - 12:05 p.m.
|
OpenOOD v1.5: Enhanced Benchmark for Out-of-Distribution Detection
(
Oral
)
>
link
SlidesLive Video Out-of-Distribution (OOD) detection is critical for the reliable operation of open-world intelligent systems. Despite the emergence of an increasing number of OOD detection methods, the evaluation inconsistencies present challenges for tracking the progress in this field. OpenOOD v1 initiated the unification of the OOD detection evaluation but faced limitations in scalability and scope. In response, this paper presents OpenOOD v1.5, a significant improvement from its predecessor that ensures accurate and standardized evaluation of OOD detection methodologies at large scale. Notably, OpenOOD v1.5 extends its evaluation capabilities to large-scale datasets (ImageNet) and foundation models (e.g., CLIP and DINOv2), and expands its scope to investigate full-spectrum OOD detection which considers semantic and covariate distribution shifts at the same time. This work also contributes in-depth analysis and insights derived from comprehensive experimental results, thereby enriching the knowledge pool of OOD detection methodologies. With these enhancements, OpenOOD v1.5 aims to drive advancements and offer a more robust and comprehensive evaluation benchmark for OOD detection research. |
Jingyang Zhang · Jingkang Yang · Pengyun Wang · Haoqi Wang · Yueqian Lin · Haoran Zhang · Yiyou Sun · Xuefeng Du · Kaiyang Zhou · Wayne Zhang · Yixuan Li · Ziwei Liu · Yiran Chen · Hai Li
|
Fri 12:05 p.m. - 12:15 p.m.
|
SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore
(
Oral
)
>
link
SlidesLive Video The legality of training language models (LMs) on copyrighted or otherwise restricted data is under intense debate. However, as we show, model performance significantly degrades if trained only on low-risk text (e.g., out-of-copyright books or government documents), due to its limited size and domain coverage. We present SILO, a new language model that manages this risk-performance tradeoff during inference. SILO is built by (1) training a parametric LM on the Open License Corpus (OLC), a new corpus we curate with 228B tokens of public domain and permissively licensed text and (2) augmenting it with a more general and easily modifiable nonparametric datastore (e.g., containing copyrighted books or news) that is only queried during inference. The datastore allows use of high-risk data without training on it, supports sentence-level data attribution, and enables data producers to opt out from the model by removing content from the store. These capabilities can foster compliance with data-use regulations such as the fair use doctrine in the United States and the GDPR in the European Union. Our experiments show that the parametric LM struggles on its own with domains not covered by OLC. However, access to the datastore greatly improves out of domain performance, closing 90% of the performance gap with an LM trained on the Pile, a more diverse corpus with mostly high-risk text. We also analyze which nonparametric approach works best, where the remaining errors lie, and how performance scales with datastore size. Our results suggest that it is possible to build high quality language models while mitigating legal risk. |
Sewon Min · Suchin Gururangan · Eric Wallace · Weijia Shi · Hannaneh Hajishirzi · Noah Smith · Luke Zettlemoyer 🔗 |
Fri 12:15 p.m. - 12:40 p.m.
|
Invited Talk 3
(
Invited Talk
)
>
SlidesLive Video |
Aditi Raghunathan 🔗 |
Fri 12:40 p.m. - 1:05 p.m.
|
Invited Talk 4
(
Invited Talk
)
>
SlidesLive Video |
Hoifung Poon 🔗 |
Fri 1:05 p.m. - 1:30 p.m.
|
Coffee Break
(
Break
)
>
|
🔗 |
Fri 1:30 p.m. - 2:00 p.m.
|
Invited Talk 5
(
Invited Talk
)
>
SlidesLive Video |
Ludwig Schmidt 🔗 |
Fri 2:00 p.m. - 2:50 p.m.
|
Panel Dicsussion
(
Panel Discussion
)
>
SlidesLive Video |
🔗 |
Fri 2:50 p.m. - 3:00 p.m.
|
Closing Remarks
(
Closing Remarks
)
>
|
🔗 |
-
|
The SVHN Dataset Is Deceptive for Probabilistic Generative Models Due to a Distribution Mismatch
(
Poster
)
>
link
SlidesLive Video The Street View House Numbers (SVHN) dataset is a popular benchmark dataset in deep learning. Originally designed for digit classification tasks, the SVHN dataset has been widely used as a benchmark for various other tasks including generative modeling. However, with this work, we aim to warn the community about an issue of the SVHN dataset as a benchmark for generative modeling tasks: we discover that the official split into training set and test set of the SVHN dataset are not drawn from the same distribution. We empirically show that this distribution mismatch has little impact on the classification task (which may explain why this issue has not been detected before), but it severely affects the evaluation of probabilistic generative models, such as Variational Autoencoders and diffusion models. As a workaround, we propose to mix and re-split the official training and test set when SVHN is used for tasks other than classification. We publish a new split and the corresponding indices we used to create it. |
Tim Xiao · Johannes Zenn · Robert Bamler 🔗 |
-
|
Understanding Catastrophic Forgetting in Language Models via Implicit Inference
(
Poster
)
>
link
We lack a systematic understanding of the effects of fine-tuning (via methods such as instruction-tuning or reinforcement learning from human feedback), particularly on tasks outside the narrow fine-tuning distribution. In a simplified scenario, we demonstrate that improving performance on fine-tuning tasks comes at the expense of other pretraining capabilities. We hypothesize that models implicitly infer the task of the prompt and that fine-tuning skews this inference towards fine-tuning tasks. We find that artificially making the task look farther from the fine-tuning distribution while requiring the same capability can recover some of the pretraining capabilities on our synthetic setup. Since real fine-tuning distributions are predominantly English, we apply conjugate prompting to recover pretrained capabilities in LLMs by simply translating the prompts to different languages. This allows us to recover the in-context learning abilities lost via instruction tuning, and more concerningly, recover harmful content generation suppressed by safety fine-tuning in chatbots like ChatGPT. |
Suhas Kotha · Jacob Springer · Aditi Raghunathan 🔗 |
-
|
Predicting the Performance of Foundation Models via Agreement-on-the-Line
(
Poster
)
>
link
Estimating out-of-distribution performance is critical to safely deploying machine learning models. Baek et al. showed that the phenomenon "agreement-on-the-line" (AGL) can be a reliable method for predicting OOD accuracy of models in an ensemble of CNNs trained from scratch. The current practice is to lightly fine-tune foundation models, but it is unclear whether such fine-tuning can yield the sufficiently diverse models needed for AGL based methods to work. In this paper, we develop methods for reliably applying AGL based OOD estimation to fine-tuned foundation models. In particular, we first study the case of fine-tuning a single foundation model, where we extensively show how different types of randomness contribute to the AGL of the resulting model sets; we find, somewhat surprisingly, that it is typically possible to obtain strong agreement via random initialization of the linear head alone. Next, we study how multiple foundation models, pretrained on different data sets but fine-tuned on the same task may produce agreement; we show, again rather surprisingly, that the diversity of such models is already sufficient and not too disparate for them to all lie on the same agreement line. In total, these methods enable reliable and efficient estimation of OOD accuracy for fine-tuned foundation models, without leveraging any labeled OOD data. |
Rahul Saxena · Aman Mehra · Taeyoun Kim · Christina Baek · J. Zico Kolter · Aditi Raghunathan 🔗 |
-
|
Can Transformer Models Generalize Via In-Context Learning Beyond Pretraining Data?
(
Poster
)
>
link
Transformer models, notably large language models (LLMs), have the remarkable ability to perform in-context learning (ICL) -- to perform new tasks when prompted with unseen input-output examples without any explicit model training. In this work, we study how effectively transformers can generalize beyond their pretraining data mixture, comprised of one or multiple function classes, to identify and learn new functions in-context which are outside the pretraining distribution. To investigate this question in a controlled setting, we focus on the transformers ability to in-context learn functions from simulated data. While these models do well at generalizing to new functions withing the pretrained function class, when presented with tasks or functions which are out-of-distribution from their pretraining data, we demonstrate various failure modes of transformers. Together our results suggest that the impressive ICL abilities of high-capacity transformer models may be more closely tied to the coverage of their pretraining data mixtures than inductive biases that create fundamental generalization capabilities. |
Steve Yadlowsky · Lyric Doshi · Nilesh Tripuraneni 🔗 |
-
|
Iteratively Refined Behavior Regularization for Offline Reinforcement Learning
(
Poster
)
>
link
SlidesLive Video One of the fundamental challenges for offline reinforcement learning (RL) is ensuring robustness to data distribution. Whether the data originates from a near-optimal policy or not, we anticipate that an algorithm should demonstrate its ability to learn an effective control policy that seamlessly aligns with the inherent distribution of offline data. Unfortunately, behavior regularization, a simple yet effective offline RL algorithm, tends to struggle in this regard. In this paper, we propose a new algorithm that substantially enhances behavior-regularization based on conservative policy iteration. Our key observation is that by iteratively refining the reference policy used for behavior regularization, conservative policy update guarantees gradually improvement, while also implicitly avoiding querying out-of-sample actions to prevent catastrophic learning failures. We prove that in the tabular setting this algorithm is capable of learning the optimal policy covered by the offline dataset, commonly referred to as the in-sample optimal policy. We then explore several implementation details of the algorithm when function approximations are applied. The resulting algorithm is easy to implement, requiring only a few lines of code modification to existing methods. Experimental results on the D4RL benchmark indicate that our method outperforms previous state-of-the-art baselines in most tasks, clearly demonstrate its superiority over behavior regularization. |
Xiaohan Hu · Yi Ma · Chenjun Xiao · YAN ZHENG · Jianye Hao 🔗 |
-
|
Probing the Equivariance of Image Embeddings
(
Poster
)
>
link
Probes are small networks that predict properties of underlying data from embeddings, and they provide a targeted way to illuminate the information in embeddings. While analysis with probes has become standard in NLP, there has been less exploration in vision. Our goal is to understand the invariance vs. equivariance of popular image embeddings (e.g., MAE, SimCLR, or CLIP) under certain distribution shifts. By doing so, we investigate what visual aspects from the raw images are encoded into the embeddings by these foundation models. Our probing is based on a systematic transformation prediction task that measures the visual content of embeddings along many axes, including neural style transfer, recoloring, icon/text overlays, noising, and blurring. Surprisingly, six embeddings (including SimCLR) encode enough non-semantic information to identify dozens of transformations. We also consider a generalization task, where we group similar transformations and hold out several for testing. Image-text models (CLIP, ALIGN) are better at recognizing new examples of style transfer than masking-based models (CAN, MAE). Our results show that embeddings from foundation models are equivariant and encode more non-semantic features than a supervised baseline. Hence, their OOD generalization abilities are not due to invariance to such distribution shifts. |
Cyrus Rashtchian · Charles Herrmann · Chun-Sung Ferng · Ayan Chakrabarti · Dilip Krishnan · Deqing Sun · Da-Cheng Juan · Andrew Tomkins 🔗 |
-
|
Exploring Generalisability of Self-Distillation with No Labels for SAR-Based Vegetation Prediction
(
Poster
)
>
link
In this work we pre-train a DINO-ViT based model using two Synthetic Aperture Radar datasets (S1GRD or GSSIC) across three regions (China, Conus, Europe). We fine-tune the models on smaller labeled datasets to predict vegetation percentage, and empirically study the connection between the embedding space of the models and their ability to generalize across diverse geographic regions and to unseen data. For S1GRD, embedding spaces of different regions are clearly separated, while GSSIC's overlaps. Positional patterns remain during fine-tuning, and greater distances in embeddings often result in higher errors for unfamiliar regions. With this, our work increases our understanding of generalizability for self-supervised models applied to remote sensing. |
Laura Martínez-Ferrer · Anna Jungbluth · Joseph Alejandro Gallego Mejia · Matthew Allen · Francisco Dorr · Freddie Kalaitzis · Raul Ramos-Pollán 🔗 |
-
|
AutoFT: Robust Fine-Tuning by Optimizing Hyperparameters on OOD Data
(
Poster
)
>
link
Foundation models encode a rich representation that can be adapted to a desired task by fine-tuning on task-specific data.However, fine-tuning a model on one particular data distribution often compromises the model's original performance on other distributions.Current methods for robust fine-tuning utilize various hand-crafted regularization techniques to constrain the fine-tuning process towards the base foundation model.Yet, it is hard to directly specify what characteristics of the foundation model to retain during fine-tuning, as this is influenced by the complex interplay between the pre-training, fine-tuning, and evaluation distributions.We propose AutoFT, a data-driven method for guiding foundation model adaptation: optimizing hyperparameters for fine-tuning with respect to post-adaptation performance on a small out-of-distribution (OOD) validation set.We find that when optimizing hyperparameters for OOD generalization, it is especially beneficial to use a highly expressive hyperparameter space such as per-layer learning rates and loss weight coefficients.Our evaluation demonstrates state-of-the-art performance on OOD distributions unseen during fine-tuning and hyperparameter optimization. |
Caroline Choi · Yoonho Lee · Annie Chen · Allan Zhou · Aditi Raghunathan · Chelsea Finn 🔗 |
-
|
Turn Down the Noise: Leveraging Diffusion Models for Test-time Adaptation via Pseudo-label Ensembling
(
Poster
)
>
link
The goal of test-time adaptation is to adapt a source-pretrained model to a target domain without relying on any source data. Typically, this is either done by updating the parameters of the model (model adaptation) using inputs from the target domain or by modifying the inputs themselves (input adaptation). However, methods that modify the model suffer from the issue of compounding noisy updates whereas methods that modify the input need to adapt to every new data point from scratch while also struggling with certain distribution shifts. We introduce D-TAPE (Diffusion infused Test-time Adaptation via Pseudo-label Ensembling) which leverages a pre-trained diffusion model to project the target domain images closer to the source domain and iteratively updates the model via a pseudo-label ensembling scheme. D-TAPE combines the advantages of model and input adaptations while mitigating their shortcomings. Our experiments on CIFAR-10C demonstrate D-TAPE's superiority, outperforming the strongest baseline by an average of 1.7% across 15 diverse corruptions and surpassing the strongest input adaptation baseline by an average of 18%. |
Mrigank Raman · Rohan Shah · Akash Kannan · Pranit Chawla 🔗 |
-
|
Revisiting Dynamic Evaluation: Online Adaptation for Large Language Models
(
Poster
)
>
link
SlidesLive Video We consider the problem of online finetuning the parameters of a language model at test time, also known as dynamic evaluation. While it is generally known that this approach improves the overall predictive performance, especially when considering distributional shift between training and evaluation data, we here emphasize the perspective that online-adaptation turns parameters into temporally changing states and provides a form of context-length extension with memory in weights, more in line with the concept of memory in neuroscience. We pay particular attention to the speed of adaptation (in terms of sample efficiency), sensitivity to overall distributional drift, and computational overhead for performing gradient computation and parameter updates. Our empirical study provides insights on when online adaptation is particularly interesting. We highlight that with online adaptation the conceptual distinction between in-context learning and finetuning blurs: Both are methods to condition the model on previously observed tokens. |
Amal Rannen-Triki · Jorg Bornschein · Razvan Pascanu · Alexandre Galashov · Michalis Titsias · Marcus Hutter · András György · Yee Whye Teh 🔗 |
-
|
Tackling Concept Shift in Text Classification using Entailment-style modeling
(
Poster
)
>
link
SlidesLive Video Pre-trained language models (PLMs) have seen tremendous success in text classification (TC) problems in the context of Natural Language Processing (NLP). In many real-world text classification tasks, the class definitions being learned do not remain constant but rather change with time - this is known as concept shift. Most techniques for handling concept shift rely on retraining the old classifiers with the newly labelled data. However, given the amount of training data required to fine-tune large DL models for the new concepts, the associated labelling costs can be prohibitively expensive and time consuming. In this work, we propose a reformulation, converting vanilla classification into an entailment-style problem that requires significantly less data to re-train the text classifier to adapt to new concepts. We demonstrate the effectiveness of our proposed method on both real world & synthetic datasets achieving absolute F1 gains upto 7% and 40% respectively in few-shot settings. Further, upon deployment, our solution also helped save 75% of labeling costs overall. |
Sumegh Roychowdhury · Siva Rajesh Kasa · Karan Gupta · Prasanna Srinivasa Murthy · Alok Chandra 🔗 |
-
|
Reliable Test-Time Adaptation via Agreement-on-the-Line
(
Poster
)
>
link
Test-time adaptation (TTA) methods aim to improve robustness to distribution shifts by adapting models using unlabeled data from the shifted test distribution. However, there remain unresolved challenges that undermine the reliability of TTA, which include difficulties in evaluating TTA performance, miscalibration after TTA, and unreliable hyperparameter tuning for adaptation. In this work, we make a notable and surprising observation that TTAed models strongly show the agreement-on-the-line phenomenon (Baek et al., 2022) across a wide range of distribution shifts. We find such linear trends occur consistently in a wide range of models adapted with various hyperparameters, and persist in distributions where the phenomenon fails to hold in vanilla model (i.e., before adaptation). We leverage these observations to make TTA methods more reliable from three perspectives: (i) estimating OOD accuracy (without labeled data) to determine when TTA helps and when it hurts, (ii) calibrating TTAed models again without any labeled data, and (iii) reliably determining hyperparameters for TTA without any labeled validation data. Through extensive experiments, we demonstrate that various TTA methods can be precisely evaluated, both in terms of their improvements and degradations. Moreover, our proposed methods on unsupervised calibration and hyperparameters tuning for TTA achieve results close to the ones assuming access to ground-truth labels, in both OOD accuracy and calibration error. |
Eungyeup Kim · Mingjie Sun · Aditi Raghunathan · J. Zico Kolter 🔗 |
-
|
Reward Model Underspecification in Language Model Alignment
(
Poster
)
>
link
Reward models play a key role in aligning language model applications towards human preferences. However, this setup can create a dynamic in which the policy model has the incentive to exploit errors in the reward model to achieve high reward. This means that the success of reward-based alignment depends on the ability of reward models to transfer to new distributions created by the aligned policy model. We show that reward models are \emph{underspecified}, in the sense that models that perform similarly in-distribution can yield very different rewards on policy model outputs. These differences propagate to the aligned policies, which we show to be heavily influenced by the random seed used during \emph{pretraining} of the reward model. We show that even a simple alignment strategy --- best-of-$n$ reranking --- creates a semi-adversarial dynamic between the policy and reward models, promoting outputs on which the reward models are more likely to disagree. Finally, we show that a simple ensembling strategy can help to address this issue.
|
Jacob Eisenstein · Jonathan Berant · Chirag Nagpal · Alekh Agarwal · Ahmad Beirami · Alexander D'Amour · Krishnamurthy Dvijotham · Katherine Heller · Stephen Pfohl · Deepak Ramachandran 🔗 |
-
|
Learning Causally-Aware Representations of Multi-Agent Interactions
(
Poster
)
>
link
Modeling spatial-temporal interactions between neighboring agents is at the heart of multi-agent problems such as motion forecasting and crowd navigation. Despite notable progress, it remains unclear to which extent modern representations can capture the causal relationships behind agent interactions. In this work, we take an in-depth look at the causal awareness of the learned representations, from computational formalism to controlled simulations to real-world practice. First, we cast doubt on the notion of non-causal robustness studied in the recent CausalAgents benchmark. We show that recent representations are already partially resilient to perturbations of non-causal agents, and yet modeling indirect causal effects involving mediator agents remains challenging. Further, we introduce a simple but effective regularization approach leveraging causal annotations of varying granularity. Through controlled experiments, we find that incorporating finer-grained causal annotations not only leads to higher degrees of causal awareness but also yields stronger out-of-distribution robustness. Finally, we extend our method to a sim-to-real causal transfer framework by means of cross-domain multi-task learning, which boosts generalization in practical settings even without real-world annotations. We hope our work provides more clarity to the challenges and opportunities of learning causally-aware representations in the multi-agent context while making a first step towards a practical solution. |
Yuejiang Liu · Ahmad Rahimi · Po-Chien Luan · Frano Rajič · Alexandre Alahi 🔗 |
-
|
Fusing Models with Complementary Expertise
(
Poster
)
>
link
Training AI models that generalize across tasks and domains has long been among the open problems driving AI research. The emergence of Foundation Models made it easier to obtain expert models for a given task, but the heterogeneity of data that may be encountered at test time often means that any single expert is insufficient. We consider the Fusion of Experts (FoE) problem of fusing outputs of expert models with complementary knowledge of the data distribution and formulate it as an instance of supervised learning. Our method is applicable to both discriminative and generative tasks and leads to significant performance improvements in image and text classification, text summarization, multiple-choice QA, and automatic evaluation of generated text. We also extend our method to the "frugal" setting where it is desired to reduce the number of expert model evaluations at test time. |
Hongyi Wang · Felipe Maia Polo · Yuekai Sun · Souvik Kundu · Eric Xing · Mikhail Yurochkin 🔗 |
-
|
On Mitigating Shortcut Learning for Fair Chest X-ray Classification under Distribution Shift
(
Poster
)
>
link
As machine learning models reach human level performance on many real-world medical imaging tasks, it is crucial to consider the mechanisms they may be using to make such predictions. Prior work has demonstrated the surprising ability of deep learning models to recover demographic information from chest X-rays. This suggests that disease classification models could potentially be utilizing these demographics as shortcuts, leading to prior observed performance gaps between demographic groups. In this work, we start by investigating whether chest X-ray models indeed use demographic information as shortcuts when classifying four different diseases. Next, we apply five existing methods for tackling spurious correlations, and examine performance and fairness both for the original dataset and five external hospitals. Our results indicate that shortcut learning can be corrected to remedy in-distribution fairness gaps, though this reduction often does not transfer under domain shift. We also find trade-offs between fairness and other important metrics, raising the question of whether it is beneficial to remove such shortcuts in the first place. |
Yuzhe Yang · Haoran Zhang · Dina Katabi · Marzyeh Ghassemi 🔗 |
-
|
Are all classes created equal? Domain Generalization for Domain-Linked Classes
(
Poster
)
>
link
Domain generalization (DG) focuses on transferring domain-invariant knowledge from multiple source domains (available at train time) to an $\textit{a priori}$ unseen target domain(s). This task implicitly assumes that a class of interest is expressed in multiple source domains ($\textit{domain-shared}$), which helps break the spurious correlations between domain and class and enables domain-invariant learning. However, we observe that this results in extremely poor generalization performance for classes only expressed in a specific domain ($\textit{domain-linked}$). To this end, we develop a contrastive and fairness based algorithm -- $\texttt{FOND}$ -- to learn generalizable representations for these domain-linked classes by transferring useful representations from domain-shared classes. We perform rigorous experiments against popular baselines across benchmark datasets to demonstrate that given a sufficient number of domain-shared classes $\texttt{FOND}$ achieves SOTA results for domain-linked DG.
|
Kimathi Kaai · Saad Hossain · Sirisha Rambhatla 🔗 |
-
|
Discovering environments with XRM
(
Poster
)
>
link
Successful out-of-distribution generalization requires environment annotations. Unfortunately, these are resource-intensive to obtain, and their relevance to model performance is limited by the expectations and perceptual biases of human annotators. Therefore, to enable robust AI systems across applications, we must develop algorithms to automatically discover environments inducing broad generalization. Current proposals, which divide examples based on their training error, suffer from one fundamental problem. These methods add hyper-parameters and early-stopping criteria that are impossible to tune without a validation set with human-annotated environments, the very information subject to discovery. In this paper, we propose Cross-Risk-Minimization (XRM) to address this issue. XRM trains two twin networks, each learning from one random half of the training data, while imitating confident held-out mistakes made by its sibling. XRM provides a recipe for hyper-parameter tuning, does not require early-stopping, and can discover environments for all training and validation data. Domain generalization algorithms built on top of XRM environments achieve oracle worst-group-accuracy, solving a long-standing problem in out-of-distribution generalization. |
Mohammad Pezeshki · Diane Bouchacourt · Mark Ibrahim · Nicolas Ballas · Pascal Vincent · David Lopez-Paz 🔗 |
-
|
Data Filtering Networks
(
Poster
)
>
link
Large training sets have become a cornerstone of machine learning and are the foundation for recent advances in language modeling and multimodal learning. While data curation for pre-training is often still ad-hoc, one common paradigm is to first collect a massive pool of data from the Web and then filter this candidate pool down to an actual training set via various heuristics. In this work, we study the problem of learning a data filtering network (DFN) for this second step of filtering a large uncurated dataset. Our key finding is that the quality of a network for filtering is distinct from its performance on downstream tasks: for instance, a model that performs well on ImageNet can yield worse training sets than a model with low ImageNet accuracy that is trained on a small amount of high-quality data. Based on our insights, we construct new data filtering networks that induce state-of-the-art image-text datasets. Specifically, our best performing dataset DFN-5B enables us to train state-of-the-art models for their compute budgets: among other improvements on a variety of tasks, a ViT-H trained on our dataset achieves 83.0% zero-shot transfer accuracy on ImageNet, out-performing models trained on other datasets such as LAION-2B, DataComp-1B, or OpenAI’s WIT. In order to facilitate further research in dataset design, we also release a new 2 billion example dataset DFN-2B and show that high performance data filtering networks can be trained from scratch using only publicly available data. |
Alex Fang · Albin Madappally Jose · Amit Jain · Ludwig Schmidt · Alexander Toshev · Vaishaal Shankar 🔗 |
-
|
Skill-Mix: A Flexible and Expandable Family of Evaluations for AI Models
(
Poster
)
>
link
With LLMs shifting their role from statistical modeling of language to serving as general-purpose AI agents, how should LLM evaluations change? Arguably, a key ability of an AI agent is to flexibly combine, as needed, the basic skills it has learned. This capability to combine skills plays an important role in (human) pedagogy and also in a recent paper on emergence phenomena (Arora & Goyal, 2023). A new evaluation, Skill-Mix, is introduced to measure this capability. Using a list of $N$ skills the evaluator repeatedly picks random subsets of $k$ skills and asks the LLM to produce text combining that subset of skills. Since the number of subsets grows like $N^k$, for even modest $k$ this evaluation will, with high probability, require the LLM to produce text it has not seen in the training set. The paper develops a methodology for (a) designing and administering such an evaluation, and (b) automatic grading (plus spot-checking by humans) of the results using GPT-4 as well as the open LLaMA-2 70B model. Administering a version of Skill-Mix to popular chatbots gave results that, while generally in line with prior expectations, contained surprises. Sizeable differences exist among model capabilities ---including suspected cases of ``cramming for the leaderboard''--- that are not captured by their ranking on popular LLM leaderboards. Our methodology can flexibly change to future models and model capabilities, by expanding the set of skills being tested and increasing $k$. By publicly releasing the Skill-Mix methodology, we hope it may grow into an eco-system of open evaluations for AI capabilities, including in multi-modal settings. These may serve as more trustworthy gauges of model capabilities than current leaderboards.
|
Dingli Yu · Simran Kaur · Arushi Gupta · Jonah Brown-Cohen · Anirudh Goyal · Sanjeev Arora 🔗 |
-
|
Pseudo-Calibration: Improving Predictive Uncertainty Estimation in Domain Adaptation
(
Poster
)
>
link
Unsupervised domain adaptation (UDA) improves model accuracy in an unlabeled target domain using a labeled source domain. However, UDA models often lack calibrated predictive uncertainty on target data, posing risks in safety-critical applications. In this paper, we address this under-explored challenge with Pseudo-Calibration (PseudoCal), a novel post-hoc calibration framework. In contrast to prior approaches, we consider UDA calibration as a target-domain specific unsupervised problem rather than a \emph{covariate shift} problem across domains. With a synthesized labeled pseudo-target set that captures the structure of the real target, we turn the unsupervised calibration problem into a supervised one, readily solvable with \emph{temperature scaling}. Extensive empirical evaluation across 5 diverse UDA scenarios involving 10 UDA methods, including unsupervised fine-tuning of foundation models such as CLIP, consistently demonstrates the superior performance of PseudoCal over alternative calibration methods. |
Dapeng Hu · Jian Liang · Xinchao Wang · Chuan Sheng Foo 🔗 |
-
|
Ask Your Shift if Pre-Training is Right for You
(
Poster
)
>
link
Pre-training is a widely used approach to develop models that are robust to distribution shifts. However, in practice, its effectiveness varies: fine-tuning a pre-trained model improves robustness significantly in some cases but not at all in others (compared to training from scratch). In this work, we seek to characterize the failure modes that pre-training can and cannot address. In particular, we focus on two possible failure modes of models under distribution shift: poor extrapolation (e.g., they cannot generalize to a different domain) and biases in the training data (e.g., they rely on spurious features). Our study suggests that, as a rule of thumb, pre-training can help mitigate poor extrapolation but not dataset biases. After providing theoretical motivation and empirical evidence for this finding, we explore an implication for developing robust models: fine-tuning on a (very) small, non-diverse but de-biased dataset can result in significantly more robust models than fine-tuning on a large and diverse but biased dataset. |
Benjamin Cohen-Wang · Joshua Vendrow · Aleksander Madry 🔗 |
-
|
Maximum Likelihood Estimation is All You Need for Well-Specified Covariate Shift
(
Poster
)
>
link
A key challenge of modern machine learning systems is to achieve Out-of-Distribution (OOD) generalization---generalizing to target data whose distribution differs from that of source data. Despite its significant importance, the fundamental question of ``what are the most effective algorithms for OOD generalization'' remains open even under the standard setting of covariate shift.This paper addresses this fundamental question by proving that, surprisingly, classical Maximum Likelihood Estimation (MLE) purely using source data (without any modification) achieves the minimax optimality for covariate shift under the well-specified setting. This result holds for a very large class of parametric models, including but not limited to linear regression, logistic regression, and phase retrieval, and does not require any boundedness condition on the density ratio. This paper further complement the study by proving that for the misspecified setting, MLE can perform poorly, and the Maximum Weighted Likelihood Estimator (MWLE) emerges as minimax optimal in specific scenarios, outperforming MLE. |
Jiawei Ge · Shange Tang · Jianqing Fan · Cong Ma · Chi Jin 🔗 |
-
|
Simplifying and Stabilizing Model Selection in Unsupervised Domain Adaptation
(
Poster
)
>
link
Existing model selection methods for unsupervised domain adaptation (UDA) often struggle to maintain stable performance across diverse UDA methods and UDA scenarios, frequently resulting in suboptimal or even the worst hyperparameter choices. This instability limitation poses severe risks to the safe deployment of UDA models in practical scenarios, significantly impairing the practicality and reliability of these selection approaches.To address this challenge, we introduce a novel ensemble-based validation approach called EnsV, aiming to simplify and stabilize model selection in UDA.EnsV relies solely on predictions of unlabeled target data without making any assumptions about distribution shifts, offering high simplicity and versatility. Additionally, EnsV is built upon an off-the-shelf ensemble that is theoretically guaranteed to outperform the worst candidate model, ensuring high stability.In our experiments, we benchmark EnsV against eight competitive model selection approaches, evaluating its performance across 12 UDA methods, 5 diverse UDA benchmarks, and 5 popular UDA scenarios. The results consistently highlight EnsV as a highly simple, versatile, and stable choice for practical model selection in UDA scenarios. |
Dapeng Hu · Romy Luo · Jian Liang · Chuan Sheng Foo 🔗 |
-
|
Context-Aware Meta-Learning
(
Poster
)
>
link
Large Language Models like ChatGPT demonstrate a remarkable capacity to learn new concepts during inference without any fine-tuning. However, visual models trained to detect new objects during inference have been unable to replicate this ability, and instead either perform poorly or require meta-training and/or fine-tuning on similar objects. In this work, we propose a meta-learning algorithm that emulates Large Language Models by learning new visual concepts during inference without fine-tuning. Our approach leverages a frozen pre-trained feature extractor, and analogous to in-context learning, recasts meta-learning as sequence modeling over datapoints with known labels and a test datapoint with an unknown label. On 8 out of 11 meta-learning benchmarks, our approach---without meta-training or fine-tuning---exceeds or matches the state-of-the-art algorithm, P>M>F, which is meta-trained on these benchmarks. |
Christopher Fifty · Dennis Duan · Ronald Junkins · Ehsan Amid · Jure Leskovec · Christopher Ré · Sebastian Thrun 🔗 |
-
|
Transfer Learning, Reinforcement Learning for Adaptive Control Optimization under Distribution Shift
(
Poster
)
>
link
SlidesLive Video Many control systems rely on a pipeline of machine learning models and hand-coded rules to make decisions. However, due to changes in the operating environment, these rules require constant tuning to maintain optimal system performance. Reinforcement learning (RL) can automate the online optimization of rules based on incoming data. However, RL requires extensive training data and exploration, which limits its application to new rules or those with sparse data. Here, we propose a transfer learning approach called Learning from Behavior Prior (LBP) to enable fast, sample-efficient RL optimization by transferring knowledge from an expert controller. We demonstrate this approach by optimizing the rule thresholds in a simulated control pipeline across differing operating conditions. Our method converges 5x faster than vanilla RL, with greater robustness to distribution shift between the expert and target environments. LBP reduces negative impacts during live training, enabling automated optimization even for new controllers. |
Pankaj Rajak · Wojciech Kowalinski · Fei Wang 🔗 |
-
|
Context is Environment
(
Poster
)
>
link
Two lines of work are taking center stage in AI research. On the one hand, increasing efforts are being made to build models that generalize out-of-distribution (OOD). Unfortunately, a hard lesson so far is that no proposal convincingly outperforms a simple empirical risk minimization baseline. On the other hand, large language models (LLMs) have erupted as algorithms able to learn in-context, generalizing on-the-fly to the eclectic contextual circumstances. We argue that context is environment, and posit that in-context learning holds the key to better domain generalization. Via extensive theory and experiments, we show that paying attention to context$\unicode{x2013}\unicode{x2013}$unlabeled examples as they arrive$\unicode{x2013}\unicode{x2013}$allows our proposed In-Context Risk Minimization (ICRM) algorithm to zoom-in on the test environment risk minimizer, leading to significant OOD performance improvements.
|
Sharut Gupta · David Lopez-Paz · Stefanie Jegelka · Kartik Ahuja 🔗 |
-
|
HePCo: Data-Free Heterogeneous Prompt Consolidation for Continual Federated Learning
(
Poster
)
>
link
In this paper, we focus on the important yet understudied problem of Continual Federated Learning (CFL), where a server communicates with a set of clients to incrementally learn new concepts over time without sharing or storing any data. The complexity of this problem is compounded by challenges from both the Continual and Federated Learning perspectives. Specifically, models trained in a CFL setup suffer from catastrophic forgetting which is exacerbated by data heterogeneity across clients. Existing attempts at this problem tend to impose large overheads on clients and communication channels or require access to stored data which renders them unsuitable for real-world use due to privacy. We study this problem in the context of Foundation Models and showcase their effectiveness in mitigating forgetting while minimizing overhead costs and without requiring access to any stored data. We achieve this by leveraging a prompting based approach and proposing a novel and lightweight generation and distillation scheme to aggregate client models at the server. Our approach outperforms both existing methods and our own baselines by more than 7\% on challenging image-classification benchmarks while significantly reducing communication and client-level computation costs. |
Shaunak Halbe · James S Smith · Junjiao Tian · Zsolt Kira 🔗 |
-
|
A Nearest Neighbor-Based Concept Drift Detection Strategy for Reliable Condition Monitoring
(
Poster
)
>
link
SlidesLive Video Condition monitoring is one of the most prominent industrial use cases for machine learning today. As condition monitoring applications are commonly developed using static training datasets, their long-term performance is vulnerable to concept drift in the form of time-dependent changes in environmental and operating conditions as well as data quality problems or sensor drift. When the data distribution changes, machine learning models can fail catastrophically. We show that two-sample tests of homogeneity, which form the basis of most of the available concept drift detection strategies, fail in this domain, as the live data is highly correlated and does not follow the assumption of being independent and identically distributed (i.i.d.) that is often made in academia. We propose a novel drift detection approach calledLocalized Reference Drift Detection (LRDD) to address this challenge by refining the reference set for the two-sample tests. We demonstrate the performance of the proposed approach in a preliminary evaluation on a tool condition monitoring case study. |
Nicolas Jourdan 🔗 |
-
|
Improving Domain Generalization in Contrastive Learning via Domain-Aware Temperature Control
(
Poster
)
>
link
Pre-training with contrastive learning is a powerful method for learning from sparsely labeled data. However, performance can drop considerably when there is a shift in the distribution of data available during training and test time. We study this phenomenon in the domain generalization setting in which the training data come from multiple domains, and the test data come from an unseen domain. We present a new method for contrastive learning that incorporates domain labels to increase the domain invariance of learned representations, leading to improved out-of-distribution generalization. Our method adjusts the temperature parameter in the InfoNCE loss -- which controls the relative weighting of negative pairs -- using the likelihood that a negative sample comes from the same domain as the anchor. This upweights pairs from more similar domains, forcing the model to discriminate samples based on domain-irrelevant features. To assess domain similarity, we train a domain discriminator on the learned embeddings -- critically, this allows us to adapt the weighting as the amount of domain information in the embedding space changes. Through preliminary experiments on a variant of the MNIST dataset, we demonstrate that our method yields better out-of-distribution performance compared to baselines, especially in regimes of high label sparsity (e.g., 1\%). Furthermore, our method concurrently maintains strong in-distribution task performance, greatly outperforming baselines on this measure. |
Robert Lewis · Katie Matton · Rosalind Picard · John Guttag 🔗 |
-
|
Stochastic linear dynamics in parameters to deal with Neural Networks plasticity loss
(
Poster
)
>
link
Plasticity loss has become an active topic of interest in the continual learning community. Briefly, when faced with non-stationary data, normal gradient descent losses over time the ability to train. It can take different subtle forms, from the inability of the network to generalize to its inability to optimize the training objective, and can have different causes like ill-conditioning or the saturation of activation functions. In this work we focus on the inability of neural network to optimize due to saturating activations, which particularly affects online reinforcement learning settings, where the learning process itself creates a non-stationary setting even if the environment is kept fixed. Recent works have proposed to answer this problem by relying on dynamically resetting units that seem inactive, allowing them to be tuned further. We explore an alternative approach to this based on stochastic linear dynamics in parameters which allows to model non-stationarity and provides a mechanism to adaptively and stochastically drift the parameters towards the prior, implementing a mechanism of soft parameters reset. |
Alexandre Galashov · Michalis Titsias · Razvan Pascanu · Yee Whye Teh · Maneesh Sahani 🔗 |
-
|
Can Transformers In-Context Learn Task Mixtures?
(
Poster
)
>
link
In-context learning (ICL) refers to the ability of Large Language Models (LLMs) to perform new tasks by conditioning on input-output samples without any parameter updates. Previous work has established that, in a controlled setting, transformers can optimally perform ICL for tasks from a single task family, here a single function class, when they are pretrained on example tasks from that family. Using this setting, we probe the relationship between the pretraining data mixtures and downstream ICL performance. In particular, we empirically explore the ability of pretrained transformers to \textit{select a family of tasks} (i.e. amongst distinct function classes) and \textit{perform learning within that task family} (i.e. learn a function within a function class), all in-context. We show, for pretraining task mixtures balanced across task families, the cost of unsupervised downstream ICL task-family selection is near-zero. For task families rarely seen in pretraining, downstream ICL learning curves exhibit complex, task-dependent non-monotonic behavior. We also characterize the benefit of conditional pretraining in this simplified model, showing how task-family instructions can reduce the overhead of in-context task-family selection. |
Nilesh Tripuraneni · Lyric Doshi · Steve Yadlowsky 🔗 |
-
|
Evolving Domain Adaptation of Pretrained Language Models for Text Classification
(
Poster
)
>
link
SlidesLive Video
Pre-trained language models have shown impressive performance in various text classification tasks. However, the performance of these models is highly dependent on the quality and domain of the labeled examples. In dynamic real-world environments, text data content naturally evolves over time, leading to a natural $\textit{evolving domain shift}$. Over time, this continuous temporal shift impairs the performance of static models, as their training becomes increasingly outdated.To address this issue, we propose two dynamic buffer-based adaptation strategies: one utilizes self-training with pseudo-labeling, and the other employs a tuning-free, in-context learning approach for large language models (LLMs).We validate our methods with extensive experiments on two longitudinal real-world social media datasets, demonstrating their superiority compared to unadapted baselines. Furthermore, we introduce a COVID-19 vaccination stance detection dataset, serving as a benchmark for evaluating pre-trained language models within evolving domain adaptation settings.
|
Yun-Shiuan Chuang · Rheeya Uppaal · Yi Wu · Luhang Sun · Makesh Narsimhan Sreedhar · Sijia Yang · Timothy T Rogers · Junjie Hu 🔗 |
-
|
Robustness May be More Brittle than We Think under Different Degrees of Distribution Shifts
(
Poster
)
>
link
Out-of-distribution (OOD) generalization is a complicated problem due to the idiosyncrasies of possible distribution shifts between training and test domains. Most benchmarks employ diverse datasets to address the issue; however, the degree of the distribution shift between the training domains and the test domains of each dataset remains largely fixed. Our study delves into a more nuanced evaluation setting that covers a broad range of shift degrees. We show that the robustness of neural networks can be quite brittle and inconsistent under different shift degrees, and therefore one should be more cautious in drawing conclusions from evaluations under a limited set of degrees. In addition, we find that CLIP, a representative of vision-language foundation models, can be sensitive to even minute distribution shifts of novel downstream tasks. This suggests that while pre-training may improve downstream in-distribution performance, it could have minimal or even adverse effects on generalization in certain OOD scenarios of the downstream task. |
Kaican Li · Yifan Zhang · Lanqing Hong · Zhenguo Li · Nevin L. Zhang 🔗 |
-
|
On selective classification under distribution shift
(
Poster
)
>
link
SlidesLive Video
This paper addresses the problem of selective classification for deep neural networks, where a model is allowed to abstain from low-confidence predictions to avoid potential errors. Specifically, we investigate whether the selective classification performance of ImageNet classifiers is robust to distribution shift. Motivated by the intriguing observation in recent work that many classifiers appear to have a ``broken'' confidence estimator, we start by evaluating methods to fix this issue. We focus on so-called post-hoc methods, which replace the confidence estimator of a given classifier without retraining or modifying it, thus being practically appealing.We perform an extensive experimental study of many existing and proposed confidence estimators applied to 84 pre-trained ImageNet classifiers available from popular repositories. Our results show that a simple $p$-norm normalization of the logits, followed by taking the maximum logit as the confidence estimator, can lead to considerable gains in selective classification performance, completely fixing the pathological behavior observed in many classifiers. As a consequence, the selective classification performance of any classifier becomes almost entirely determined by its corresponding accuracy. Then, we show these results are consistent under distribution shift: a method that enhances performance in the in-distribution scenario also provides similar gains under distribution shift. Moreover, although a slight degradation in selective classification performance is observed under distribution shift, this can be explained by the drop in accuracy of the classifier, together with the slight dependence of selective classification performance on accuracy.
|
Luís Felipe Cattelan · Danilo Silva 🔗 |
-
|
Understanding subgroup performance differences of fair predictors using causal models
(
Poster
)
>
link
A common evaluation paradigm compares the performance of a machine learning model across subgroups to assess properties related to fairness. In this work, we argue that distributional differences across subgroups can render this approach misleading. We consider this as a source of confounding that can lead to differences in performance across subgroups even if the model predicts the label of interest as well as possible for each subgroup. We show that these differences in model performance can be anticipated and characterized based on the causal structure of the data generating process and the choices made during the model fitting procedure (e.g. whether subgroup membership is used as a predictor). We demonstrate how to construct alternative evaluation procedures that control for this source of confounding during evaluation by implicitly matching the distribution of confounding variables across subgroups. We emphasize that the selection of appropriate control variables requires domain knowledge and selection of contextually inappropriate control variables can produce misleading results. |
Stephen Pfohl · Natalie Harris · Chirag Nagpal · David Madras · Vishwali Mhasawade · Olawale Salaudeen · Katherine Heller · Sanmi Koyejo · Alexander D'Amour 🔗 |
-
|
AutoVP: An Automated Visual Prompting Framework and Benchmark
(
Poster
)
>
link
Visual prompting (VP) is an emerging parameter-efficient fine-tuning approach to adapting pre-trained vision models to solve various downstream image-classification tasks. However, there has hitherto been little systematic study of the design space of VP and no clear benchmark for evaluating its performance. To bridge this gap, we propose AutoVP, an end-to-end expandable framework for automating VP design choices, along with 12 downstream image-classification tasks that can serve as a holistic VP-performance benchmark. Our design space covers 1) the joint optimization of the prompts; 2) the selection of pre-trained models, including image classifiers and text-image encoders; and 3) model output mapping strategies, including nonparametric and trainable label mapping. Our extensive experimental results show that AutoVP outperforms the best-known current VP methods by a substantial margin, having up to 6.7% improvement in accuracy; and attains a maximum performance increase of 27.5% compared to linear-probing (LP) baseline. AutoVP thus makes a two-fold contribution: serving both as an efficient tool for hyperparameter tuning on VP design choices, and as a comprehensive benchmark that can reasonably be expected to accelerate VP’s development. |
Hsi-Ai Tsao · Lei Hsiung · Pin-Yu Chen · Sijia Liu · Tsung-Yi Ho 🔗 |
-
|
Beyond Top-Class Agreement: Using Divergences to Forecast Performance under Distribution Shift
(
Poster
)
>
link
Knowing if a model will generalize to data `in the wild' is crucial for safe deployment. To this end, we study model disagreement notions that consider the full predictive distribution - specifically disagreement based on Hellinger distance and Kullback–Leibler divergence. We find that divergence-based scores provide better test error estimates and detection rates on out-of-distribution data compared to their top 1 counterparts. Experiments involve standard vision and foundation models. |
Mona Schirmer · Dan Zhang · Eric Nalisnick 🔗 |
-
|
Continual Learning with Low Rank Adaptation
(
Poster
)
>
link
SlidesLive Video Recent work using pretrained transformers has shown impressive performance when fine-tuned with data from the downstream problem of interest. However, they struggle to retain that performance when the data characteristics changes. In this paper, we focus on continual learning, where a pre-trained transformer is updated to perform well on new data, while retaining its performance on data it was previously trained on. Earlier works have tackled this primarily through methods inspired from prompt tuning. We question this choice, and investigate the applicability of Low Rank Adaptation (LoRA) to continual learning. On a range of domain-incremental learning benchmarks, our LoRA-based solution, CoLoR, yields state-of-the-art performance, while still being as parameter efficient as the prompt tuning based methods. |
Martin Wistuba · Prabhu Teja Sivaprasad · Lukas Balles · Giovanni Zappella 🔗 |
-
|
Adaptive Sharpness-Aware Pruning for Robust Sparse Networks
(
Poster
)
>
link
Robustness and compactness are two essential attributes of deep learning models that are deployed in the real world. The goals of robustness and compactness may seem to be at odds, since robustness requires generalization across domains, while the process of compression exploits specificity in one domain. We introduce \textit{Adaptive Sharpness-Aware Pruning (AdaSAP)}, which unifies these goals through the lens of network sharpness. The AdaSAP method produces sparse networks that are robust to input variations which are \textit{unseen at training time}. We achieve this by strategically incorporating weight perturbations in order to optimize the loss landscape. This allows the model to be both primed for pruning and regularized for improved robustness. AdaSAP improves the robust accuracy of pruned models on classification and detection over recent methods by up to +6\% on OOD datasets, over a wide range of compression ratios, pruning criteria, and architectures. |
Anna Bair · Hongxu Yin · Maying Shen · Pavlo Molchanov · Jose M. Alvarez 🔗 |
-
|
Bilevel Optimization to Learn Training Distributions for Language Modeling under Domain Shift
(
Poster
)
>
link
Language models trained on very large web corpora have become a central piece of modern language processing. In this paradigm, the large, heterogeneous training set rarely matches the distribution of the application domain. This work considers modifying the training distribution in the case where one can observe a small sample of data reflecting the test conditions. We propose an algorithm based on recent formulation of this problem as an online, bilevel optimization problem. We show that this approach compares favorably with alternative strategies from the domain adaptation literature. |
David Grangier · Pierre Ablin · Awni Hannun 🔗 |
-
|
Geometry-Calibrated DRO: Combating Over-Pessimism with Free Energy Implications
(
Poster
)
>
link
Distributionally Robust Optimization (DRO) optimizes the worst-case risk within an uncertainty set to resist distribution shifts. However, DRO suffers from over-pessimism, leading to low-confidence predictions, poor parameter estimations as well as poor generalization in practice. In this work, we uncover one probable root cause of over-pessimism: excessive focus on noisy samples. To alleviate the impact of noise, we incorporate data geometry into calibration terms in DRO, resulting in our novel Geometry-Calibrated DRO (GCDRO) \emph{for regression}. We establish that our risk objective aligns with the Helmholtz free energy in statistical physics, which could extend to standard DRO methods. Leveraging gradient flow in Wasserstein space, we develop an approximate minimax optimization algorithm with a bounded error ratio and elucidate how our approach mitigates noisy sample effects. |
Jiashuo Liu · Jiayun Wu · Tianyu Wang · Hao Zou · Peng Cui 🔗 |
-
|
Channel Selection for Test-Time Adaptation Under Distribution Shift
(
Poster
)
>
link
To ensure robustness and generalization to real-world scenarios, test-time adaptation has been recently studied as an approach to adjust models to a new data distribution during inference. Test-time batch normalization is a simple and popular method that achieved compelling performance on domain shift benchmarks by recalculating batch normalization statistics on test batches. However, in many practical applications this technique is vulnerable to label distribution shifts. We propose to tackle this challenge by only selectively adapting channels in a deep network, minimizing drastic adaptation that is sensitive to label shifts. We find that adapted models significantly improve the performance compared to the baseline models and counteract unknown label shifts. |
Pedro Vianna · Muawiz Chaudhary · An Tang · Guy Cloutier · Guy Wolf · Michael Eickenberg · Eugene Belilovsky 🔗 |
-
|
Better than Balancing: Debiasing through Data Attribution
(
Poster
)
>
link
Spurious correlations in the training data can cause serious problems for machine learning deployment. However, common debiasing approaches which intervene on the training procedure (e.g., by adjusting the loss) can be especially sensitive to regularization and hyperparameter selection. In this paper, we advocate for a data-based perspective on model debiasing by directly targeting the root causes of the bias within the training data itself. Specifically, we leverage data attribution techniques to isolate specific examples that disproportionally drive reliance on the spurious correlation. We find that removing these training examples can efficiently debias the final classifier. Moreover, our method requires no additional hyperparameters, and does not require group annotations for the training data. |
Saachi Jain · Kimia Hamidieh · Kristian Georgiev · Marzyeh Ghassemi · Aleksander Madry 🔗 |
-
|
Enhancing Robustness of Foundation Model Representations under Provenance-related Distribution Shifts
(
Poster
)
>
link
SlidesLive Video Foundation models are a current focus of attention in both industry and academia. While they have shown their capabilities in a variety of tasks, in-depth research is required to determine their robustness to distribution shift when used as a basis for supervised machine learning. This is especially important in the context of clinical data, with particular limitations related to data accessibility, lack of pretraining materials, and limited availability of high-quality annotations. In this work, we examine the stability of models based on representations from foundation models under distribution shift. We focus on confounding by provenance, a form of distribution shift that emerges in the context of multi-institutional datasets when there are differences in source-specific language use and class distributions. Using a sampling strategy that synthetically induces varying degrees of distribution shift, we evaluate the extent to which representations from foundation models result in predictions that are inherently robust to confounding by provenance. Additionally, we examine the effectiveness of a straightforward confounding adjustment method inspired by Pearl's conception of backdoor adjustment. Results indicate that while foundation models do show some out-of-the-box robustness to confounding-by-provenance related distribution shifts, this can be considerably improved through adjustment. These findings suggest a need for deliberate adjustment of predictive models using representations from foundation models in the context of source-specific distributional differences. |
Xiruo Ding · Zhecheng Sheng · Brian Hur · Feng Chen · Serguei Pakhomov · Trevor Cohen 🔗 |
-
|
Towards Global, General-Purpose Pretrained Geographic Location Encoders
(
Poster
)
>
link
Geographic location is essential for modeling tasks in climate-related fields ranging from ecology to the Earth system sciences. Here, a meaningful feature representation of locations is highly helpful as a description that encodes location-specific aspects. However, obtaining such a representation is challenging and requires an algorithm to distill semantic information of one location from available data. To address this challenge, we introduce GeoCLIP, a global, general-purpose geographic location encoder that provides vector embeddings summarizing the characteristics of a given location for convenient usage in diverse downstream tasks. We show that GeoCLIP embeddings, pretrained on multi-spectral Sentinel-2 satellite data, can be used for various predictive out-of-domain tasks, including temperature prediction and animal recognition in imagery, and outperform existing competing approaches. This demonstrates the potential of general-purpose location encoders and opens the door to learning meaningful representations of our planet from the vast, varied, and largely untapped modalities of geospatial data. |
Konstantin Klemmer · Esther Rolf · Caleb Robinson · Lester Mackey · Marc Rußwurm 🔗 |
-
|
HyperNetwork Approximating Future Parameters for Time Series Forecasting under Temporal Drifts
(
Poster
)
>
link
SlidesLive Video Models for time series forecasting require the ability to extrapolate from previous observations. Yet, extrapolation is challenging, especially when the data spanning several periods is under temporal drifts where each period has a different distribution. To address this problem, we propose HyperGPA, a hypernetwork that generates a target model's parameters that are expected to work well (i.e., be an optimal model) for each period. HyperGPA discovers an underlying hidden dynamics which causes temporal drifts over time, and generates the model parameters for a target period, aided by the structures of computational graphs. In comprehensive evaluations, we show that target models whose parameters are generated by HyperGPA are up to 64.1\% more accurate than baselines. |
Jaehoon Lee · Chan Kim · Gyumin Lee · Haksoo Lim · Jeongwhan Choi · Kookjin Lee · Dongeun Lee · Sanghyun Hong · Noseong Park 🔗 |
-
|
LCA-on-the-Line: Benchmarking Out of Distribution Generalization with Class Taxonomies
(
Poster
)
>
link
SlidesLive Video In this paper, we address the challenge of assessing model generalization under Out-of-Distribution (OOD) conditions. We reintroduce the Least Common Ancestor (LCA) distance, a metric that has been largely overshadowed since ImageNet. By leveraging the WordNet hierarchy, we utilize the LCA to measure the taxonomic distance between labels and predictions, presenting it as a benchmark for model generalization. The LCA metric proves especially robust in comparison to previous state-of-the-art metrics when evaluating diverse models, including both vision-only and vision-language models on natural distribution shift datasets. To validate our benchmark's efficacy, we perform an extensive empirical study on 75 models spanning five distinct ImageNet-OOD datasets. Our findings reveal a strong linear correlation between in-domain ImageNet LCA scores and OOD Top1 performance across ImageNet-S/R/A/ObjectNet. This discovery gives rise to a novel evaluation framework termed "LCA-on-the-Line", facilitating unified and consistent assessments across a broad spectrum of models and datasets.Beside introducing an evaluative tool, we also delve into the intricate ties between the LCA metric and model generalization. By aligning model predictions more closely with the WordNet hierarchy and refining prompt engineering in zero-shot vision-language models, we offer tangible strategies to improve model generalization. We challenge the prevailing notion that LCA offers no added evaluative value over top-1 accuracy, our research provides invaluable insights and actionable techniques to enhance model robustness and generalization across various tasks and scenarios. |
Jia Shi · Gautam Rajendrakumar Gare · Jinjin Tian · Siqi Chai · Zhiqiu Lin · Arun Balajee Vasudevan · Di Feng · Francesco Ferroni · Shu Kong · Deva Ramanan 🔗 |
-
|
Towards General-Purpose In-Context Learning Agents
(
Poster
)
>
link
Reinforcement Learning (RL) algorithms are usually hand-crafted, driven by the research and engineering of humans. An alternative approach is to automate this research process via meta-learning. A particularly ambitious objective is to automatically discover new RL algorithms from scratch that use in-context learning to learn-how-to-learn entirely from data while also generalizing to a wide range of environments. Those RL algorithms are implemented entirely in neural networks, by conditioning on previous experience from the environment, without any explicit optimization-based routine at meta-test time. To achieve generalization, this requires a broad task distribution of diverse and challenging environments. Our Transformer-based Generally Learning Agents (GLAs) are an important first step in this direction. Our GLAs are meta-trained using supervised learning techniques on an offline dataset with experiences from RL environments that is augmented with random projections to generate task diversity. During meta-testing our agents perform in-context meta-RL on entirely different robotic control problems such as Reacher, Cartpole, or HalfCheetah that were not in the meta-training distribution. |
Louis Kirsch · James Harrison · Daniel Freeman · Jascha Sohl-Dickstein · Jürgen Schmidhuber 🔗 |
-
|
Do Transformers Parse while Predicting the Masked Word?
(
Poster
)
>
link
Pre-trained language models have been shown to encode linguistic structures like parse trees in their embeddings while being trained unsupervised. Some doubts have been raised whether the models are doing parsing or only some computation weakly correlated with it. Concretely: (a) Is it possible to explicitly describe transformers with realistic embedding dimensions, number of heads, etc. that are capable of doing parsing ---or even approximate parsing? (b) Why do pre-trained models capture parsing structure? This paper takes a step toward answering these questions in the context of generative modeling with PCFGs. We show that masked language models like BERT or RoBERTa of moderate sizes can approximately execute the Inside-Outside algorithm for the English PCFG (Marcus et al., 1993). We also show that the Inside-Outside algorithm is optimal for masked language modeling loss on the PCFG-generated data. We conduct probing experiments on models pre-trained on PCFG-generated data to show that this not only allows recovery of approximate parse tree, but also recovers marginal span probabilities computed by the Inside-Outside algorithm, which suggests an implicit bias of masked language modeling towards this algorithm. |
Haoyu Zhao · Abhishek Panigrahi · Rong Ge · Sanjeev Arora 🔗 |
-
|
Retrieval-based Language Models Using a Multi-domain Datastore
(
Poster
)
>
link
Retrieval-based language models (LMs) can generalize well to unseen test domains, but typically assume access to a datastore of examples from the target domain. It remains an open question if these models are robust with more general datastores, which may include other out of domain data or cover multiple different test domains. In this paper, we study this question by constructing a multi-domain datastore, using a kNN-LM approach. We first show that, on domains that are part of the multi-domain datastore, the model is comparable to or even better than the model with an oracle test domain datastore. We also find that, on domains that are unseen during training and not part of the datastore, using a multi-domain datastore consistently outperforms an oracle single-domain datastore. Together, our results show that kNN-LM is highly robust at out-of-distribution generalization and can effectively target many domains at once, without the oracle domain knowledge assumptions included in all previous work. |
Rulin Shao · Sewon Min · Luke Zettlemoyer · Pang Wei Koh 🔗 |
-
|
Continually Adapting Optimizers Improve Meta-Generalization
(
Poster
)
>
link
Meta-learned optimizers increasingly outperform analytical handcrafted optimizers such as SGD and Adam. On some tasks, however, they fail to generalize strongly, underperforming handcrafted methods. Then one can fall back on handcrafted methods through a guard, to combine the efficiency benefits of learned optimizers and the guarantees of analytical methods. At some point in the iterative optimization process, however, such guards may make the learned optimizer incompatible with the remaining optimization, and thus useless for further progress. Our novel method Meta Guard keeps adapting the learned optimizer to the target optimization problem. It experimentally outperforms other baselines, adapting to new tasks during training. |
Wenyi Wang · Louis Kirsch · Francesco Faccio · Mingchen Zhuge · Jürgen Schmidhuber 🔗 |
-
|
Connect Later: Improving Fine-tuning for Robustness with Targeted Augmentations
(
Poster
)
>
link
Models trained on a labeled source domain (e.g., bright, nearby astronomical objects) often generalize poorly when deployed on an out-of-distribution (OOD) target domain (e.g., faint, distant objects). In the domain adaptation setting where unlabeled target data is available, self-supervised pretraining (e.g., masked autoencoding or contrastive learning) is a promising method to mitigate this performance drop. Pretraining improves OOD error when the generic data augmentations used (e.g., masking or cropping) connect the source and target domains, which may be far apart in the input space. In this paper, we show on real-world tasks that standard fine-tuning after pretraining does not consistently improve OOD error over just supervised learning on labeled source data. To better leverage pretraining for distribution shifts, we propose Connect Later: after pretraining with generic augmentations to learn good representations within the source and target domains, fine-tune with targeted augmentations designed with knowledge of the distribution shift to better connect the domains. Connect Later improves average OOD error over standard fine-tuning and supervised learning with targeted augmentations on 3 real-world datasets: astronomical time-series classification (AstroClassification) by 12%, redshift prediction for astronomical time-series (Redshifts) by 0.03 RMSE (11% relative), and wildlife species identification (iWildCam-WILDS) by 0.9%, achieving the state-of-the-art on AstroClassification and on iWildCam-WILDS with ResNet-50. |
Helen Qu · Sang Michael Xie 🔗 |
-
|
Confidence-Based Model Selection: When to Take Shortcuts in Spurious Settings
(
Poster
)
>
link
Effective machine learning models learn both robust features that directly determine the outcome of interest (e.g., an object with wheels is more likely to be a car), and shortcut features (e.g., an object on a road is more likely to be a car). The latter can be a source of error under distributional shift, when the correlations change at test-time. The prevailing sentiment in the robustness literature is to avoid such correlative shortcut features and learn robust predictors. However, while robust predictors perform better on worst-case distributional shifts, they often sacrifice accuracy on majority subpopulations. In this paper, we argue that shortcut features should not be entirely discarded. Instead, if we can identify the subpopulation to which an input belongs, we can adaptively choose among models with different strengths to achieve high performance on both majority and minority subpopulations. We propose COnfidence-baSed MOdel Selection (COSMOS), where we observe that model confidence can effectively guide model selection. Notably, COSMOS does not require any target labels or group annotations, either of which may be difficult to obtain or unavailable. We evaluate COSMOS on four datasets with spurious correlations, each with multiple test sets with varying levels of data distribution shift. We find that COSMOS achieves 2-5% lower average regret across all subpopulations, compared to using only robust predictors or other model aggregation methods. |
Annie Chen · Yoonho Lee · Amrith Setlur · Sergey Levine · Chelsea Finn 🔗 |
-
|
Two-stage LLM Fine-tuning with Less Specialization and More Generalization
(
Poster
)
>
link
Pretrained large language models (LLMs) are general purpose problem solvers applicable to a diverse set of tasks with prompts. They can be further improved towards a specific task by fine-tuning on a specialized dataset. However, fine-tuning usually makes the model narrowly specialized on this dataset with reduced general in-context learning performances, which is undesirable whenever the fine-tuned model needs to handle additional tasks where no fine-tuning data is available. In this work, we first demonstrate that fine-tuning on a single task indeed decreases LLMs' general in-context learning performance. We discover one important cause of such forgetting, format specialization, where the model overfits to the format of the fine-tuned task.We further show that format specialization happens at the very beginning of fine-tuning. To solve this problem, we propose Prompt Tuning with MOdel Tuning (ProMoT), a simple yet effective two-stage fine-tuning framework that reduces format specialization and improves generalization.ProMoT offloads task-specific format learning into additional and removable parameters by first doing prompt tuning and then fine-tuning the model itself with this soft prompt attached. With experiments on several fine-tuning tasks and 8 in-context evaluation tasks, we show that ProMoT achieves comparable performance on fine-tuned tasks to standard fine-tuning, but with much less loss of in-context learning performances across a board range of out-of-domain evaluation tasks. More importantly, ProMoT can even enhance generalization on in-context learning tasks that are semantically related to the fine-tuned task, e.g. ProMoT on En-Fr translation significantly improves performance on other language pairs, and ProMoT on NLI improves performance on summarization.Experiments also show that ProMoT can improve the generalization performance of multi-task training. |
Yihan Wang · Si Si · Daliang Li · MICHAL LUKASIK · Felix Yu · Cho-Jui Hsieh · Inderjit Dhillon · Sanjiv Kumar 🔗 |
-
|
An Empirical Study of Uncertainty Estimation Techniques for Detecting Drift in Data Streams
(
Poster
)
>
link
SlidesLive Video In safety-critical domains such as autonomous driving and medical diagnosis, the reliability of machine learning models is crucial. One significant challenge to reliability is concept drift, which can cause model deterioration over time. Traditionally, drift detectors rely on true labels, which are often scarce and costly. This study conducts a comprehensive empirical evaluation of using uncertainty values as substitutes for error rates in detecting drifts, aiming to alleviate the reliance on labeled post-deployment data. We examine five uncertainty estimation methods in conjunction with the ADWIN detector across seven real-world datasets. Our results reveal that while the SWAG method exhibits superior calibration, the overall accuracy in detecting drifts is not notably impacted by the choice of uncertainty estimation method, with even the most basic method demonstrating competitive performance. These findings offer valuable insights into the practical applicability of uncertainty-based drift detection in real-world, safety-critical applications. |
Anton Winter · Nicolas Jourdan · Tristan Wirth · Volker Knauthe · Arjan Kuijper 🔗 |
-
|
Outlier-Robust Group Inference via Gradient Space Clustering
(
Poster
)
>
link
Traditional machine learning models focus on achieving good performance on the overall training distribution, but they often underperform on minority groups. Existing methods can improve the worst-group performance, but they can have several limitations: (i) they require group annotations, which are often expensive and sometimes infeasible to obtain, and/or (ii) they are sensitive to outliers. Most related works fail to solve these two issues simultaneously as they focus on conflicting perspectives of minority groups and outliers. We address the problem of learning group annotations in the presence of outliers by clustering the data in the space of gradients of the model parameters. We show that data in the gradient space has a simpler structure while preserving information about minority groups and outliers, making it suitable for standard clustering methods like DBSCAN. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art both in terms of downstream worst-group performance. |
Yuchen Zeng · Kristjan Greenewald · Luann Jung · Kangwook Lee · Justin Solomon · Mikhail Yurochkin 🔗 |
-
|
Towards Calibrated Robust Fine-Tuning of Vision-Language Models
(
Poster
)
>
link
SlidesLive Video While fine-tuning unleashes the potential of a pre-trained model to a specific task, it trades off the model’s generalization capability on out-of-distribution (OOD) datasets. To mitigate this, robust fine-tuning aims to ensure performance on OOD datasets as well as an in-distribution (ID) dataset for which the model is tuned. However, another criterion for reliable machine learning (ML) – confidence calibration, is overlooked despite its increasing demand for real-world high-stakes ML applications (e.g. autonomous driving). For the first, we raise concerns about the calibration of fine-tuned vision-language models (VLMs) by showing that naive fine-tuning and even state-of-the-art robust fine-tuning methods hurt the calibration of pre-trained VLMs, especially on OOD datasets. To address this, we provide a simple approach, called a calibrated robust fine-tuning (CaRot), that incentivizes the calibration and robustness on both ID and OOD datasets. Empirical results on ImageNet-1K distribution shift evaluation verify the effectiveness of our method. |
Changdae Oh · Mijoo Kim · Hyesu Lim · Junhyeok Park · Euiseog Jeong · Zhi-Qi Cheng · Kyungwoo Song 🔗 |