Workshop
I Can’t Believe It’s Not Better (ICBINB): Failure Modes in the Age of Foundation Models
Estefany Kelly Buchanan · Fan Feng · Andreas Kriegler · Ian Mason · Tobias Uelwer · Yubin Xie · Rui Yang
Room R02-R05 (level 2)
In the past year, tools such as ChatGPT, Stable Diffusion and SegmentAnything have had an immediate impact on our everyday lives. Many of these tools have been built using foundation models, that is, very large models (having billions or trillions of parameters) trained on vast amounts of data (Bommasani et al., 2021). The excitement around these foundation models and their capabilities might suggest that all the interesting problems have been solved and artificial general intelligence is just around the corner (Wei et al., 2022; Bubeck et al., 2023).
At this year’s I Can’t Believe It’s Not Better workshop we invite papers to cooly reflect on this optimism and to demonstrate that there are in fact many difficult and interesting open questions. The workshop will specifically focus on failure modes of foundation models, especially unexpected negative results. In addition, we invite contributions that will help us understand current and future disruptions of machine learning subfields as well as instances where these powerful methods merely remain complementary to another subfield of machine learning.
Contributions on the failure modes of foundation models might consider:
- Domain-specific areas where the application of foundation models did not work as expected.
- Failures in the safety and explainability of foundation models.
- The limits of current foundation model methodologies.
Besides failure modes of foundation models, this workshop also considers their impact on the ML ecosystem and potential problems that remain to be solved by these new systems. In this context, relevant questions include:
- Where do foundation models leave researchers in other areas (e.g., AI for science, recommender systems, Bayesian methods, bioinformatics)?
- Which important problems are not solved by training large models with large amounts of data?
- What unexpected negative results were encountered when applying foundation models to a specific domain?
Schedule
Sat 6:45 a.m. - 7:00 a.m.
|
Opening Remarks
(
Introduction
)
>
SlidesLive Video Welcome and introduction to ICBINB |
Ian Mason 🔗 |
Sat 7:00 a.m. - 7:30 a.m.
|
Machine Learning and Morphology: Opportunities and Challenges
(
Invited Talk
)
>
SlidesLive Video Morphology in evolutionary biology is used to quantify visible characteristics of specimens, a crucial aspect in addressing the biodiversity crisis. To investigate the impact of anthropogenic impacts, researchers have constructed extensive image databases. Obviously, these databases make the field optimal for the integration of machine learning. However, traditional methods used in morphometrics are grounded in diagnostic structures proposed by biologists. In contrast to that, machine learning approaches autonomously extract features without explicit biological motivation. This talk focuses on the potential misunderstandings that can arise when applying machine learning in morphometrics. Specifically, the focus is on the biological interpretation of machine learning models, exploring instances where models demonstrate high accuracy yet struggle with coherent biological interpretation. The presentation showcases experiments that highlight the tension between excellent quantitative results but often lacks biological interpretation. |
Wilfried Wöber 🔗 |
Sat 7:30 a.m. - 8:00 a.m.
|
Dissociating Language and Thought in Large Language Models
(
Invited Talk
)
>
SlidesLive Video Today’s large language models (LLMs) routinely generate coherent, grammatical and seemingly meaningful paragraphs of text. This achievement has led to speculation that LLMs have become “thinking machines”, capable of performing tasks that require reasoning and/or world knowledge. In this talk, I will introduce a distinction between formal competence—knowledge of linguistic rules and patterns—and functional competence—understanding and using language in the world. This distinction is grounded in human neuroscience, which shows that formal and functional competence recruit different cognitive mechanisms. I will show that the word-in-context prediction objective has allowed LLMs to essentially master formal linguistic competence; however, pretrained LLMs still lag behind at many aspects of functional linguistic competence, prompting engineers to adopt specialized fune-tuning techniques and/or couple an LLM with external modules. I will illustrate the formal-functional distinction using the domains of English grammar and arithmetic, respectively. I will then turn to generalized world knowledge, a domain where this distinction is much less clear-cut, and discuss our efforts to leverage both cognitive science and NLP to develop systematic ways to probe generalized world knowledge in text-based LLMs. Overall, the formal/functional competence framework clarifies the discourse around LLMs, helps develop targeted evaluations of their capabilities, and suggests ways for developing better models of real-life language use. |
Anna Ivanova 🔗 |
Sat 8:00 a.m. - 8:30 a.m.
|
Coffee Break
(
Break
)
>
|
🔗 |
Sat 8:30 a.m. - 8:35 a.m.
|
Adversarial Attacks and Defenses in Large Language Models: Old and New Threats
(
Spotlight
)
>
link
SlidesLive Video Over the past decade, there has been extensive research aimed at enhancing the robustness of neural networks, yet this problem remains vastly unsolved. Here, one major impediment has been the overestimation of the robustness of new defense approaches due to faulty defense evaluations. Flawed robustness evaluations necessitate rectifications in subsequent works, dangerously slowing down the research and providing a false sense of security. In this context, we will face substantial challenges associated with an impending adversarial arms race in natural language processing, specifically with closed-source Large Language Models (LLMs), such as ChatGPT, Google Bard, or Anthropic’s Claude. We provide a first set of prerequisites to improve the robustness assessment of new approaches and reduce the amount of faulty evaluations. Additionally, we identify embedding space attacks on LLMs as another viable threat model for the purposes of generating malicious content in open-sourced models. Finally, we demonstrate on a recently proposed defense that, without LLM-specific best practices in place, it is easy to overestimate the robustness of a new approach. Code is available at https://anonymous.4open.science/r/LLMEmbeddingAttack-6C3C |
Leo Schwinn · David Dobre · Stephan Günnemann · Gauthier Gidel 🔗 |
Sat 8:35 a.m. - 8:40 a.m.
|
Compositional Generalization in Vision-Language Models uses the Language Modality only
(
Spotlight
)
>
link
SlidesLive Video Compositionality is a common property in many modalities including text and images, but the compositional generalization of multi-modal models is not well-understood. In this paper, we identify two sources of visual-linguistic compositionality: linguistic priors and the interplay between images and texts. We show that current attempts to improve compositional generalization rely on linguistic priors rather than on information in the image, as the strength of the language model in detecting sentences that are syntactically and semantically likely overwhelms the vision part of the model. We find in particular that a benchmark for compositionality mostly favors pure language models. Finally, we propose a new benchmark for compositionality without such linguistic priors |
🔗 |
Sat 8:40 a.m. - 8:45 a.m.
|
A Study on the Calibration of In-context Learning
(
Spotlight
)
>
link
Modern auto-regressive models are trained to minimize log loss by predicting the next word so that they are expected to get calibrated answers when framing problems as next-token prediction tasks. We study such formulation of in-context learning, a widely used way to adapt frozen large language models (LLMs), and found the trade-offs between performance and calibration on a wide range of natural language understanding and reasoning tasks. Human evaluation shows that the hallucination rates can well align with the miscalibrated results. Furthermore, we find that selecting in-context examples from test datasets and common recalibration techniques that are widely effective such as temperature scaling may provide limited gains for calibration errors, suggesting that new methods may be required for settings where models are expected to be reliable. |
🔗 |
Sat 8:45 a.m. - 8:50 a.m.
|
Can LLM-Generated Misinformation Be Detected?
(
Spotlight
)
>
link
SlidesLive Video The advent of Large Language Models (LLMs) has made a transformative impact. However, the potential that LLMs such as ChatGPT can be exploited to generate misinformation has posed a serious concern to online safety and public trust. A fundamental research question is: will LLM-generated misinformation cause more harm than human-written misinformation? We propose to tackle this question from the perspective of detection difficulty. We first build a taxonomy of LLM-generated misinformation. Then we categorize and validate the potential real-world methods for generating misinformation with LLMs. Then, through extensive empirical investigation, we discover that LLM-generated misinformation can be harder to detect for humans and detectors compared to human-written misinformation with the same semantics, which suggests it can have more deceptive styles and potentially cause more harm. We also discuss the implications of our discovery on combating misinformation in the age of LLMs and the countermeasures. |
Canyu Chen · Kai Shu 🔗 |
Sat 8:50 a.m. - 8:55 a.m.
|
Self-Evaluation Improves Selective Generation in Large Language Models
(
Spotlight
)
>
link
SlidesLive Video Safe deployment of large language models (LLMs) may benefit from a reliable method for assessing their generated content to determine when to abstain or to selectively generate. While likelihood-based metrics such as perplexity are widely employed, recent research has demonstrated the limitations of using sequence-level probability estimates given by LLMs as reliable indicators of generation quality. Conversely, LLMs have demonstrated strong calibration at the token level, particularly when it comes to choosing correct answers in multiple-choice questions or evaluating true/false statements. In this work, we reformulate open-ended generation tasks into token-level prediction tasks, and leverage LLMs' superior calibration at the token level. We instruct an LLM to self-evaluate its answers, employing either a multi-way comparison or a point-wise evaluation approach, with the option to include an ``None of the above'' option to express the model's uncertainty explicitly. We benchmark a range of scoring methods based on self-evaluation and evaluate their performance in selective generation using TruthfulQA and TL;DR. Through extensive experiments with PaLM-2 and GPT-3, we demonstrate that self-evaluation based scores not only improve accuracy, but also correlate better with the overall quality of generated content. |
Jie Ren · Yao Zhao · Tu Vu · Peter Liu · Balaji Lakshminarayanan 🔗 |
Sat 8:55 a.m. - 9:00 a.m.
|
Filter bubbles and affective polarization in user-personalized large language model outputs
(
Spotlight
)
>
link
SlidesLive Video Echoing the history of search engines and social media content rankings, the advent of large language models (LLMs) has led to a push for increased personalization of model outputs to individual users. In the past, personalized recommendations and ranking systems have been linked to the development of filter bubbles (serving content that may confirm a user's existing biases) and affective polarization (strong negative sentiment towards those with differing views). In this work, we explore how prompting a leading large language model, ChatGPT-3.5, with a user's political affiliation prior to asking factual questions about public figures and organizations leads to differing results. We observe that left-leaning users tend to receive more positive statements about left-leaning political figures and media outlets, while right-leaning users see more positive statements about right-leaning entities. This pattern holds across presidential candidates, members of the U.S. Senate, and media organizations with ratings from AllSides. When qualitatively evaluating some of these outputs, there is evidence that particular facts are included or excluded based on the user's political affiliation. These results illustrate that personalizing LLMs based on user demographics carry the same risks of affective polarization and filter bubbles that have been seen in other personalized internet technologies. This ``failure mode" should be monitored closely as there are more attempts to monetize and personalize these models. |
Tomo Lazovich 🔗 |
Sat 9:00 a.m. - 10:30 a.m.
|
Poster Session
(
Poster Session
)
>
|
🔗 |
Sat 10:30 a.m. - 12:00 p.m.
|
Lunch
(
Lunch Break
)
>
|
🔗 |
Sat 12:00 p.m. - 12:30 p.m.
|
Active and Online Learning with Large (and Combinatorial) Models
(
Invited Talk
)
>
SlidesLive Video Active learning consists in sequentially and adaptively constructing a data-set in the hope of improving the learning speed by avoiding useless data-points where the current model is already correct with large probability and by focusing on uncertainty regions. During this talk, I will give a short reminder on the potential benefits and pitfalls of active learning, especially in large and combinatorial models. |
🔗 |
Sat 12:30 p.m. - 12:40 p.m.
|
When Do Prompting and Prefix-Tuning Work? A Theory of Capabilities and Limitations
(
Contributed talk
)
>
link
SlidesLive Video Context-based fine-tuning methods like prompting, in-context learning, soft prompting (prompt tuning) and prefix-tuning have gained popularity as they often match the performance of full fine-tuning with a fraction of the parameters. Still, there is little theoretical understanding of how these techniques influence the internal computation of the model and their expressiveness limitations. We show that despite the continuous embedding space being more expressive than the discrete token space, soft-prompting and prefix-tuning are strictly less expressive than full fine-tuning. Concretely, context-based fine-tuning cannot change the relative attention pattern over the content and can only bias the outputs of an attention layer in a fixed direction. While this means that context-based fine-tuning techniques can successfully elicit or combine skills already present in the pretrained model, they cannot learn tasks requiring new attention patterns. |
🔗 |
Sat 12:40 p.m. - 12:50 p.m.
|
A Natural Experiment on LLM Data Contamination in Code Generation
(
Contributed talk
)
>
link
SlidesLive Video Recent claims about the impressive abilities of large language models (LLMs) are often supported by evaluating publicly available benchmarks. Since LLMs train on wide swaths of the internet, this practice raises concerns of data contamination, i.e., evaluating on examples that are explicitly or implicitly included in the training data. Data contamination remains notoriously challenging to measure and mitigate, even with partial attempts like controlled experimentation of training data, canary strings, or embedding similarities. In this work, we conduct the first thorough longitudinal analysis of data contamination in LLMs by using the natural experiment of training cutoffs in GPT models to look at benchmarks released over time. Specifically, we consider two code/mathematical problem-solving datasets, Codeforces and Project Euler, and find statistically significant trends among LLM pass rate vs. GitHub popularity and release date that provide strong evidence of contamination. By open-sourcing our dataset, raw results, and evaluation framework, our work paves the way for rigorous analyses of data contamination in modern models. We conclude with a discussion of best practices and future steps for publicly releasing benchmarks in the age of LLMs that train on webscale data. |
🔗 |
Sat 12:50 p.m. - 1:00 p.m.
|
The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"
(
Contributed talk
)
>
link
SlidesLive Video We expose a surprising failure of generalization in auto-regressive large language models (LLMs). If a model is trained on a sentence of the form "A is B", it will not automatically generalize to the reverse direction "B is A". This is the Reversal Curse. For instance, if a model is trained on "Olaf Scholz was the ninth Chancellor of Germany", it will not automatically be able to answer the question, "Who was the ninth Chancellor of Germany?". Moreover, the likelihood of the correct answer ("Olaf Scholz") will not be higher than for a random name. Thus, models exhibit a basic failure of logical deduction and do not generalize a prevalent pattern in their training set (i.e. if "A is B" occurs, "B is A" is more likely to occur). We provide evidence for the Reversal Curse by finetuning GPT-3 and Llama-1 on fictitious statements such as "Uriah Hawthorne is the composer of Abyssal Melodies" and showing that they fail to correctly answer "Who composed Abyssal Melodies?". The Reversal Curse is robust across model sizes and model families and is not alleviated by data augmentation. We also evaluate ChatGPT (GPT-3.5 and GPT-4) on questions about real-world celebrities, such as "Who is Tom Cruise's mother? [A: Mary Lee Pfeiffer]" and the reverse "Who is Mary Lee Pfeiffer's son?". GPT-4 correctly answers questions like the former 79% of the time, compared to 33% for the latter. This shows a failure of logical deduction that we hypothesize is caused by the Reversal Curse. |
🔗 |
Sat 1:00 p.m. - 1:30 p.m.
|
Coffee Break
(
Coffee Break
)
>
|
🔗 |
Sat 1:30 p.m. - 2:00 p.m.
|
Limitations of Fine-Tuning for Aligning LLMs
(
Invited Talk
)
>
SlidesLive Video |
David Krueger 🔗 |
Sat 2:00 p.m. - 2:30 p.m.
|
Measurement in the Age of LLMs: An Application to Political Ideology Scaling
(
Invited Talk
)
>
SlidesLive Video Much of social science is centered around terms like “ideology” or “power”, which generally elude precise definition, and whose contextual meanings are trapped in surrounding language. This talk explores the use of large language models (LLMs) to flexibly navigate the conceptual clutter inherent to social scientific measurement tasks. We rely on LLMs’ remarkable linguistic fluency to elicit ideological scales of both legislators and text, which accord closely to established methods and our own judgement. A key aspect of our approach is that we elicit such scores directly, instructing the LLM to furnish numeric scores itself. This approach is methodologically "dumb" and shouldn't "work" according to classical principles of measurement. We nevertheless find surprisingly compelling results, which we showcase through a variety of different case studies. |
Aaron Schein 🔗 |
Sat 2:30 p.m. - 3:20 p.m.
|
Panel: Failure Modes in the Age of Foundation Models. (David Krueger, Christoph Lampert, Tatiana Likhomanenko, Aaron Schein. Moderator: Naomi Saphra)
(
Panel Discussion
)
>
SlidesLive Video |
🔗 |
Sat 3:20 p.m. - 3:30 p.m.
|
Closing Remarks (Awards and outlook)
(
Ending Comments
)
>
SlidesLive Video Thanks and Awards |
Yubin Xie 🔗 |
-
|
Do Language Models Know When They're Hallucinating References?
(
Poster
)
>
link
State-of-the-art language models (LMs) are famous for "hallucinating'' references. These fabricated article and book titles lead to harms, obstacles to their use, and public backlash. While other types of LM hallucinations are also important, we propose hallucinated references as the "drosophila'' of research on hallucination in large language models (LLMs), as they are particularly easy to study. We show that simple search engine queries reliably identify such hallucinations, which facilitates evaluation. To begin to dissect the nature of hallucinated LM references, we attempt to classify them using black-box queries to the same LM, without consulting any external resources. Consistency checks done with direct queries about whether the generated reference title is real (inspired by Kadavath et al. (2022), Lin et al. (2022) and Manakul (2023)) are compared to consistency checks with indirect queries which ask for ancillary details such as the authors of the work. These consistency checks are found to be partially reliable indicators of whether or not the reference is a hallucination.In particular, we find that LMs often hallucinate differing authors of hallucinated references when queried in independent sessions, while consistently identify authors of real references. This suggests that the hallucination may be more a generation issue than inherent to current training techniques or representation. |
Ayush Agrawal · Mirac Suzgun · Lester Mackey · Adam Tauman Kalai 🔗 |
-
|
From Failures to Factuality: A Study on ChatGPT in Open-Domain QA
(
Poster
)
>
link
Recent advancements in Large Language Models, such as ChatGPT, have demonstrated significant potential to impact various aspects of human life. However, ChatGPT still faces challenges in providing reliable and accurate answers to user questions. To better understand the model's particular weaknesses in this context, we embark an in-depth exploration of open-domain question answering.Specifically, we undertake a detailed examination of ChatGPT's failures, categorized into: comprehension, factuality, specificity, and inference. We further pinpoint factuality as the most contributing failure and identify two critical abilities associated with factuality: knowledge memorization and knowledge recall. Through experiments focusing on factuality, we propose several potential enhancement strategies. Our findings suggest that augmenting the model with granular external knowledge and cues for knowledge recall can enhance the model's factuality in answering questions. |
Shen Zheng · Jie Huang · Kevin Chang 🔗 |
-
|
On the performance of Multimodal Language Models
(
Poster
)
>
link
Instruction-tuned large language models (LLMs) have demonstrated promising zero-shot generalization capabilities across various downstream tasks. Recent research has introduced multimodal capabilities to LLMs by integrating independently pretrained vision encoders through model grafting. These multimodal variants undergo instruction tuning, similar to LLMs, enabling effective zero-shot generalization for multimodal tasks. This study conducts a comparative analysis of different multimodal instruction tuning approaches and evaluates their performance across a range of tasks, including complex reasoning, conversation, image captioning, multiple-choice questions (MCQs), and binary classification. Through rigorous benchmarking and ablation experiments, we reveal key insights for guiding architectural choices when incorporating multimodal capabilities into LLMs. However, current approaches have limitations; they do not sufficiently address the need for a diverse multimodal instruction dataset, which is crucial for enhancing task generalization. Additionally, they overlook issues related to truthfulness and factuality when generating responses. These findings illuminate current methodological constraints in adapting language models for image comprehension and provide valuable guidance for researchers and practitioners seeking to harness multimodal versions of LLMs. |
Utsav Garg · Erhan Bas 🔗 |
-
|
Transformer-Based Large Language Models Are Not General Learners: A Universal Circuit Perspective
(
Poster
)
>
link
Large Language Models (LLMs) have demonstrated remarkable proficiency across diverse tasks, evoking perceptions of ``sparks of Artificial General Intelligence (AGI)". A key question naturally arises: *Can foundation models lead to AGI?* In this work, we try to answer this question partially by formally considering the capabilities of Transformer-based LLMs (T-LLMs) from the perspective of universal circuits. By investigating the expressive power of realistic T-LLMs as universal circuits, we show that a T-LLM of size $\operatorname{poly}(n)$ cannot perform all the basic operators of input length $O\left(\operatorname{poly}(\log n)\right)$. We also demonstrate that a constant-depth-$\operatorname{poly}(n)$-size log-precision T-LLM cannot faithfully execute prompts of complexity $n$. Our analysis provides a concrete theoretical foundation that T-LLMs can only be universal circuits for limited function classes. In other words, T-LLMs are not general learners. Furthermore, we exhibit that a constant-depth-$\operatorname{poly}(n)$-size log-precision T-LLM can memorize $O\left(\operatorname{poly}(n)\right)$ instances, which could partially explain the seeming inconsistency between LLMs' empirical successes and our negative results. To the best of our knowledge, our work takes the first step towards analyzing the limitations of T-LLMs as general learners within a rigorous theoretical framework. Our results promote the understanding of LLMs' capabilities and highlight the need for innovative architecture designs beyond Transformers to break current limitations.
|
Yang Chen · Yitao Liang · Zhouchen Lin 🔗 |
-
|
A Study on Improving Reasoning in Language Models
(
Poster
)
>
link
Accurately carrying out complex reasoning is a crucial component of deployable and reliable language models. While current language models can exhibit this capability with few-shot guidance, accurate reasoning is primarily restricted to larger model sizes. In this work, we explore methods for improving the reasoning capabilities of smaller language models which are more deployable than their larger counterparts. Specifically, we look at variations of supervised learning, online reinforcement learning with PPO, and distillation from larger models. Surprisingly, for reasoning tasks such as CommonsenseQA and GSM8K, we find that simple filtered supervised learning often outperforms reward-conditioned supervised learning, and that simpler iterative supervised learning performs on par with online reinforcement learning. |
Yuqing Du · Alexander Havrilla · Sainbayar Sukhbaatar · Pieter Abbeel · Roberta Raileanu 🔗 |
-
|
Interactive Model Correction with Natural Language
(
Poster
)
>
link
In supervised learning, models are trained to extract correlations from a static dataset. This often leads to models that rely on spurious correlations that fail to generalize to new data distributions, such as a bird classifier that relies on the background of an image. Preventing models from latching on to spurious correlations necessarily requires additional information beyond labeled data. Existing methods incorporate forms of additional instance-level supervision, such as labels for spurious features or additional labeled data from a balanced distribution. Such strategies can become prohibitively costly for large-scale datasets since they require additional annotation at a scale close to the original training data. We hypothesize that far less supervision suffices if we provide targeted feedback about the misconceptions of models trained on a given dataset. We introduce Clarify, a novel natural language interface and method for interactively correcting model misconceptions. Through Clarify, users need only provide a short text description to describe a model's consistent failure patterns, such as ``water background'' for a bird classifier. Then, in an entirely automated way, we use such descriptions to improve the training process by reweighting the training data or gathering additional targeted data. Our empirical results show that non-expert users can successfully describe model misconceptions via Clarify, improving worst-group accuracy by an average of 7.3% in two datasets with spurious correlations. Finally, we use Clarify to find and rectify 31 novel spurious correlations in ImageNet, improving minority-split accuracy from 21.1% to 28.7%. |
Yoonho Lee · Michelle Lam · Helena Vasconcelos · Michael Bernstein · Chelsea Finn 🔗 |
-
|
Structure-Aware Path Inference for Neural Finite State Transducers
(
Poster
)
>
link
Finite-state transducers (FSTs) are a traditional approach to string-to-string mapping. Each FST path specifies a possible alignment of input and output strings. Compared to an unstructured seq2seq model, the FST includes an explicit latent alignment variable and equips it with domain-specific hard constraints and featurization, which can improve generalization from small training sets.Previous work has shown how to score the FST paths with a trainable neural architecture; this improves the model's expressive power by dropping the usual Markov assumption but makes inference more difficult for the same reason. In this paper, we focus on the resulting challenge of imputing the latent alignment path that explains a given pair of input and output strings (e.g. during training). We train three autoregressive approximate models for amortized inference of the path, which can then be used as proposal distributions for importance sampling. All three models perform lookahead. Our most sophisticated (and novel) model leverages the FST structure to consider the graph of future paths; unfortunately, we find that it loses out to the simpler approaches---except on an \emph{artificial} task that we concocted to confuse the simpler approaches. |
Weiting Tan · Chu-Cheng Lin · Jason Eisner 🔗 |
-
|
Analyzing the factual knowledge of parameter efficient instruction tuned mid-size Large Language Models
(
Poster
)
>
link
Large Language Models (LLM) have significantly improved Natural LanguageProcessing (NLP) by enhancing the accuracy, efficiency, and versatility of variousNLP applications, from text generation to language translation, due to their abilityto capture and leverage vast amounts of linguistic and factual knowledge. WhileLLM have pushed the boundaries, they typically need to be further instructiontuned to get improved performance on niche applications. In this paper, we focuson analyzing the factual knowledge of LLM keeping in mind the practical aspectsof using LLM by: 1) training only a small injection model (having ≈ 0.05 %of the parameters of the base LLM) using the Low Rank Adapation (LoRA)parameter efficient technique, and 2) restricting our study to Llama-2-13b-chat andStableBeluga-13B, which are two mid-size LLM having 13 billion parameters andare based on the LLama 2 architecture. The injection model is instruction tuned forKnowledge Base (KB) construction on the LM-KBC 2023 challenge dataset, whichcontains subject-relation-object triplets of Wikipedia entities across 21 differentfactual relations. Our empirical analysis shows that even after instruction tuning,the LLM are: 1) deficient in foundational knowledge of many must-know areaslike Geography, 2) unable to effectively use the context supplied in the prompt,and 3) fragile to subtle changes in prompt at inference. The source code for ourexperiments can be found at: https://github.com/Ffc1234/NIPSICBINBsubmission |
Anmol Nayak · Hari prasad Timmapathini 🔗 |
-
|
Beyond Erdos-Renyi: Generalization in Algorithmic Reasoning on Graphs
(
Poster
)
>
link
Neural algorithmic reasoning excels in many graph algorithms, but assessment mainly focuses on the Erdős-Rényi (ER) graph family. This study delves into graph algorithmic models' generalization across diverse distributions. Testing a leading model exposes overreliance on ER graphs for generalization assessment. We further investigate two scenarios: generalisation to every target distribution and single target distributions. Our results suggest that achieving the former is not as trivial and achieving the latter can be aided selecting source distribution via novel Tree Mover's Distance interpretation. |
Dobrik Georgiev · Pietro Lió · Jakub Bachurski · Junhua Chen · Tunan Shi 🔗 |
-
|
Exploring and Improving the Spatial Reasoning Abilities of Large Language Models
(
Poster
)
>
link
Large Language Models (LLMs) represent formidable tools for sequence modeling, boasting an innate capacity for general pattern recognition. Nevertheless, their broader spatial reasoning capabilities remain insufficiently explored. In this paper, we investigate the zero-shot performance of LLMs when confronted with a limited dataset comprising 3D robotic trajectory data and associated tasks, such as directional and motion labeling. Additionally, we introduce a novel prefix-based prompting mechanism, which yields a 30\% improvement on the 3D trajectory data and an increase of up to 16\% on SpartQA tasks when contrasted with the conventional vanilla prompt baseline (with gains over Chain-of-Thought prompting as well). The experimentation with 3D trajectory data offers an intriguing glimpse into the manner in which LLMs engage with numerical and spatial information, thus laying a solid foundation for the identification of target areas for future enhancements. |
Manasi Sharma 🔗 |
-
|
Towards Better Understanding of Domain Shift on Linear-Probed Visual Foundation Models
(
Poster
)
>
link
Visual foundation models have emerged in recent years to offer similar promise as their language counterparts: The ability to produce representations of visual data that can be successfully used in a variety of tasks and contexts. One common way this is shown in published literature is through ``domain generalization'' experiments of linear models trained from representations produced by foundation models (i.e. linear probes). These experiments largely limit themselves to a small number of benchmark data sets and report accuracy as the single figure of merit, but give little insight beyond these numbers as to how different foundation models represent shifts.In this work we perform an empirical evaluation that expands the scope of previous reported results in order to give better understanding into how domain shifts are modeled. Namely, we investigate not just how models generalize across domains, but how models produce features that may enable domain transfer. Our evaluation spans a number of recent visual foundation models and benchmarks, and we provide discussion that emphasizes the need for further investigation. |
Eric Heim 🔗 |
-
|
How Many Raters Do You Need? Power Analysis for Foundation Models
(
Poster
)
>
link
Due to their highly stochastic nature, as well as the complexity of the tasks they can perform, foundation models (large machine learning models) are poorly suited for conventional machine learning evaluation methods. This is because machine learning evaluation methods typically assume behavior to be deterministic and simple enough to be measured against gold standard data with unitary, authoritative, "correct" answers using straightforward metrics such as accuracy, precision, and recall. In this work, we propose an evaluation framework suitable for foundation models, which takes into account variance in the responses of both machine model and human rater. Utilizing recent advances in p-value estimation, we investigate the trade-offs between the number of items in a test set, the number of responses per item, the sampling method, and the metric, when measuring the comparative differences between two hypothetical foundation models at various degrees of similarity. When two models are very far apart in their predictive performance, fewer raters are needed to confidently compare them, as expected. However, as the models draw closer, we find that a larger number of annotators than are currently typical in annotation collection are needed to ensure the power analysis correctly reflects the difference in performance. |
Christopher Homan · Shira Wein · Chris Welty · Lora Aroyo 🔗 |
-
|
Can Visual Scratchpads With Diagrammatic Abstractions Augment LLM Reasoning?
(
Poster
)
>
link
When humans reason about complex text-based questions, we leverage diagrammatic abstractions drawn on a visual scratchpad. In this paper, we introduce and explore the capabilities of Visual-Scratchpad, a method that augments a large language foundation model (LLM) with diagrammatic execution and readout. We enable the LLM to generate drawing commands and to readout abstractions from the resulting picture. The visual readout operation uses a visual foundation model, optionally finetuned with expert iteration. Here, we show that although Visual-Scratchpad outperforms an inference-only LLM, it surprisingly yields worse performance compared to a single finetuned LLM. Through experiments, we propose that this gap is due to the failure mode of vision foundation models in understanding abstractions in diagrams. |
Joy Hsu · Gabriel Poesia · Jiajun Wu · Noah Goodman 🔗 |
-
|
Exploring DINO: Emergent Properties and Limitations for Synthetic Aperture Radar Imagery
(
Poster
)
>
link
Self-supervised learning (SSL) models have recently demonstrated remarkable performance across various tasks, including image segmentation. This study delves into the emergent characteristics of the Self-Distillation with No Labels (DINO) algorithm and its application to Synthetic Aperture Radar (SAR) imagery. We pre-train a vision transformer (ViT)-based DINO model using unlabeled SAR data, and later fine-tune the model to predict high resolution land cover maps. We rigorously evaluate the utility of attention maps generated by the ViT backbone, and compare them with the model's token embedding space. We observe a small improvement in model performance with pre-training compared to training from scratch, and discuss the limitations and opportunities of SSL for remote sensing and land cover segmentation. Beyond small performance increases, we show that ViT attention maps hold great intrinsic value for remote sensing, and could provide useful inputs to other algorithms. With this, our work lays the ground-work for bigger and better SSL models for Earth Observation. |
Joseph Alejandro Gallego Mejia · Anna Jungbluth · Laura Martínez-Ferrer · Francisco Dorr · Matthew Allen · Freddie Kalaitzis · Raul Ramos-Pollán 🔗 |
-
|
The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"
(
Poster
)
>
link
We expose a surprising failure of generalization in auto-regressive large language models (LLMs). If a model is trained on a sentence of the form "A is B", it will not automatically generalize to the reverse direction "B is A". This is the Reversal Curse. For instance, if a model is trained on "Olaf Scholz was the ninth Chancellor of Germany", it will not automatically be able to answer the question, "Who was the ninth Chancellor of Germany?". Moreover, the likelihood of the correct answer ("Olaf Scholz") will not be higher than for a random name. Thus, models exhibit a basic failure of logical deduction and do not generalize a prevalent pattern in their training set (i.e. if "A is B" occurs, "B is A" is more likely to occur). We provide evidence for the Reversal Curse by finetuning GPT-3 and Llama-1 on fictitious statements such as "Uriah Hawthorne is the composer of Abyssal Melodies" and showing that they fail to correctly answer "Who composed Abyssal Melodies?". The Reversal Curse is robust across model sizes and model families and is not alleviated by data augmentation. We also evaluate ChatGPT (GPT-3.5 and GPT-4) on questions about real-world celebrities, such as "Who is Tom Cruise's mother? [A: Mary Lee Pfeiffer]" and the reverse "Who is Mary Lee Pfeiffer's son?". GPT-4 correctly answers questions like the former 79% of the time, compared to 33% for the latter. This shows a failure of logical deduction that we hypothesize is caused by the Reversal Curse. |
Lukas Berglund · Meg Tong · Maximilian Kaufmann · Mikita Balesni · Asa Cooper Stickland · Tomasz Korbak · Owain Evans 🔗 |
-
|
Hallucination of Large Language Models in Finance: An Empirical Examination
(
Poster
)
>
link
The hallucination issue is recognized as a fundamental deficiency of large language models (LLMs), especially when applied to domains such as finance, education, and law. Despite the growing concerns, there has been a lack of empirical study. In this paper, we provide an empirical examination of LLMs' hallucination behaviors in financial tasks. Firstly, we empirically investigate the ability of explaining financial concepts and terminologies. Secondly, we assess the models' capacity of querying historical stock prices. Thirdly, to alleviate hallucination, we evaluate two practical methods: the Retrieval Augmentation Generation (RAG) method and the zero-shot tool learning method for a function to generate a query command. Finally, our finding is that off-the-shelf LLMs experience serious hallucination behaviors in financial tasks. Therefore, there is an urgent need to call for research efforts in mitigating LLMs' hallucination. |
Haoqiang Kang · Xiao-Yang Liu 🔗 |
-
|
Is Scaling Learned Optimizers Worth It? Evaluating The Value of VeLO's 4000 TPU Months
(
Poster
)
>
link
We analyze VeLO (versatile learned optimzer, the largest scale attempt to train a general purpose ``foundational'' optimizer to date. VeLO was trained on thousands of machine learning tasks over 4000 TPU months with the goal of producing an optimizer capable of generalizing to new problems while being hyper-parameter free, and outperforming industry standards such as Adam. We independently evaluate VeLO on the MLcommons optimizer benchmark suite. We find that contrary to initial claims: (1) VeLO has a critical hyper-parameter that needs problem-specific tuning, (2) VeLO does not necessarily outperform competitors in quality of solution found, and (3) VeLO is not faster than competing optimizers at reducing the training loss. These observations call into question VeLO's generality and the value of the investment in training it. |
Fady Rezk · Antreas Antoniou · Henry Gouk · Timothy Hospedales 🔗 |
-
|
Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation
(
Poster
)
>
link
Recent advances in image tokenizers, such as VQ-VAE, have enabled text-to-image generation using auto-regressive methods, similar to language modeling. However, these methods have yet to leverage pre-trained language models, despite their adaptability to various downstream tasks. In this work, we explore this gap by adapting a pre-trained language model for auto-regressive text-to-image generation, and find that pre-trained language models offer limited help. We provide a two-fold explanation by analyzing tokens from each modality. First, we demonstrate that image tokens possess significantly different semantics compared to text tokens, rendering pre-trained language models no more effective in modeling them than randomly initialized ones. Second, the text tokens in the image-text datasets are too simple compared to normal language model pre-training data, which causes the catastrophic degradation of language models' capability. |
Yuhui Zhang · Brandon McKinzie · Zhe Gan · Vaishaal Shankar · Alexander Toshev 🔗 |
-
|
SentimentPulse: Temporal-Aware Custom Language Models vs. GPT-3.5 for Consumer Sentiment
(
Poster
)
>
link
Large Language Models are trained on an extremely large corpus of text data to allow better generalization but this blessing can also become a curse and significantly limit their performance in a subset of tasks. In this work, we argue that LLMs are notably behind well-tailored and specifically designed models where the temporal aspect is important in making decisions and the answer depends on the timespan of available training data. We prove our point by comparing two major architectures: first, SentimentPulse, our proposed real-time consumer sentiment analysis approach that leverages custom language models and continual learning techniques, and second, GPT-3 which is tested on the same data. Unlike foundation models, which lack temporal context, our custom language model is pre-trained on time-stamped data, making it uniquely suited for real-time application. Additionally, we employ continual learning techniques to pre-train the model, and then classification and contextual multi-arm bandits to fine-tune the model, enhancing its adaptability and performance over time. We present a comparative analysis of the predictions accuracy of both architectures. To the best of our knowledge, this is the first application of custom language models for real-time consumer sentiment analysis beyond the scope of conventional surveys. |
Lixiang Li · Nagender Aneja · Alina Nesen · Bharat Bhargava 🔗 |
-
|
Compositional Generalization in Vision-Language Models uses the Language Modality only
(
Poster
)
>
link
Compositionality is a common property in many modalities including text and images, but the compositional generalization of multi-modal models is not well-understood. In this paper, we identify two sources of visual-linguistic compositionality: linguistic priors and the interplay between images and texts. We show that current attempts to improve compositional generalization rely on linguistic priors rather than on information in the image, as the strength of the language model in detecting sentences that are syntactically and semantically likely overwhelms the vision part of the model. We find in particular that a benchmark for compositionality mostly favors pure language models. Finally, we propose a new benchmark for compositionality without such linguistic priors |
Chenwei Wu · Patrick Haffner · Erran Li Li · Stefano Ermon · Rong Ge 🔗 |
-
|
A Negative Result on Gradient Matching for Selective Backprop
(
Poster
)
>
link
With increasing scale in model and dataset size, the training of deep neural networks becomes a massive computational burden. One approach to speed up the training process is Selective Backprop. For this approach, we perform a forward pass to obtain a loss value for each data point in a minibatch. The backward pass is then restricted to a subset of that minibatch, prioritizing high-loss examples.We build on this approach, but seek to improve the subset selection mechanism by choosing the (weighted) subset which best matches the mean gradient over the entire minibatch. We use the gradients w.r.t. the model's last layer as a cheap proxy, resulting in virtually no overhead in addition to the forward pass. At the same time, for our experiments we add a simple random selection baseline which has been absent from prior work. Surprisingly, we find that both the loss-based as well as the gradient-matching strategy fail to consistently outperform the random baseline. |
Lukas Balles · Cedric Archambeau · Giovanni Zappella 🔗 |
-
|
Can Segment Anything Model Improve Semantic Segmentation?
(
Poster
)
>
link
Recently, Segment Anything Model (SAM) has gained considerable attention in the field of computer vision establishing itself as a pioneering foundation model for segmentation. Notably, SAM excels in generating high-quality segmentation masks, yet it lacks in semantic labels. In contrast, conventional semantic segmentation models generate rather accurate semantic labels but often produce suboptimal segmentation masks. The notion of leveraging SAM's superior mask quality to enhance the performance of conventional semantic segmentation models appears intuitive. However, our preliminary experiments reveal that the integration of SAM with these models does not result in any discernible improvement. Specifically, when assessing the performance of SAM's integration into two baseline semantic segmentation models, DeepLab and OneFormer, we find no significant enhancements in the mean Intersection over Union (mIoU) on the Pascal VOC and ade20k datasets. Consequently, we conclude that, as it stands, the highly acclaimed foundational model is not the preferred solution for the semantic segmentation task. Instead, a more cautious and thoughtful approach is imperative to unlock any potential benefits in this context. |
Maryam Qamar · Chaoning Zhang · Donghoon Kim · Muhammad Salman Ali · Sung-Ho Bae 🔗 |
-
|
When Do Prompting and Prefix-Tuning Work? A Theory of Capabilities and Limitations
(
Poster
)
>
link
Context-based fine-tuning methods like prompting, in-context learning, soft prompting (prompt tuning) and prefix-tuning have gained popularity as they often match the performance of full fine-tuning with a fraction of the parameters. Still, there is little theoretical understanding of how these techniques influence the internal computation of the model and their expressiveness limitations. We show that despite the continuous embedding space being more expressive than the discrete token space, soft-prompting and prefix-tuning are strictly less expressive than full fine-tuning. Concretely, context-based fine-tuning cannot change the relative attention pattern over the content and can only bias the outputs of an attention layer in a fixed direction. While this means that context-based fine-tuning techniques can successfully elicit or combine skills already present in the pretrained model, they cannot learn tasks requiring new attention patterns. |
Aleksandar Petrov · Philip Torr · Adel Bibi 🔗 |
-
|
A Study on the Calibration of In-context Learning
(
Spotlight
)
>
|
Hanlin Zhang · yifan zhang · Yaodong Yu · Eric Xing · Himabindu Lakkaraju · Sham Kakade 🔗 |
-
|
Segment Anything Model (SAM) Enhances Pseudo-Labels for Weakly Supervised Semantic Segmentation
(
Poster
)
>
link
Weakly supervised semantic segmentation (WSSS) aims to bypass the need for laborious pixel-level annotation by using only image-level annotation. Most existing methods rely on Class Activation Maps (CAM) to derive pixel-level pseudo-labels and use them to train a fully supervised semantic segmentation model. Although these pseudo-labels are class-aware, indicating the coarse regions for particular classes, they are not object-aware and fail to delineate accurate object boundaries. To address this, we introduce a simple yet effective method harnessing the Segment Anything Model (SAM), a class-agnostic foundation model capable of producing fine-grained instance masks of objects, parts, and subparts. We use CAM pseudo-labels as cues to select and combine SAM masks, resulting in high-quality pseudo-labels that are both class-aware and object-aware. Our approach is highly versatile and can be easily integrated into existing WSSS methods without any modification. Despite its simplicity, our approach shows consistent gain over the state-of-the-art WSSS methods on both PASCAL VOC and MS-COCO datasets. |
Tianle Chen · Zheda Mai · Ruiwen Li · Wei-Lun (Harry) Chao 🔗 |
-
|
An Examination of the Robustness of Reference-Free Image Captioning Evaluation Metrics
(
Poster
)
>
link
Recently, reference-free metrics such as CLIPScore Hessel et al. (2021) and UMIC Lee et al. (2021) have been proposed for automatic evaluation of image captions. Our focus lies in evaluating the robustness of these metrics in scenarios that require distinguishing between two captions with high lexical overlap but very different meanings. Our findings reveal that despite their high correlation with human judgments, both CLIPScore and UMIC struggle to identify fine-grained errors. While both metrics exhibit strong sensitivity to visual grounding errors, their sensitivity to caption implausibility errors is limited. Furthermore, we found that both metrics are sensitive to variations in the size of image-relevant objects mentioned in the caption, while CLIPScore is also quite sensitive to the number of mentions of image-relevant objects in the caption. Regarding linguistic aspects of a caption, both metrics show weak comprehension of negation, UMIC is majorly impacted by the caption length, and CLIPScore is insensitive to the structure of the caption to a great extent. We hope our findings will guide further improvements in reference-free evaluation of image captioning. |
Saba Ahmadi · Aishwarya Agrawal 🔗 |
-
|
Zero-shot capabilities of visual language models with prompt engineering for images of animals
(
Poster
)
>
link
Visual Language Models have exhibited impressive performance on new tasks in a zero-shot setting. Language queries enable these large models to classify or detect objects even when presented with a novel concept in a shifted domain. We explore the limits of this capability by presenting Grounding DINO with images and concepts from field images of marine and terrestrial animals. By manipulating the language prompts, we found that the embedding space but does necessarily encode Latinate scientific names, but still yields potentially useful localizations due to a strong sense of general objectness. Grounding DINO struggled with objects in a challenging underwater setting, but improved when fed expressive prompts that explicitly described morphology. These experiments suggest that large models still have room to grow in domain use-cases and illuminate avenues for strengthening their understanding of shape to further improve zero-shot performance. |
Andrea Tejeda Ocampo · Eric C. Orenstein · Kakani Katija 🔗 |
-
|
Surprising Deviations from Bayesian View in In-Context Learning
(
Poster
)
>
link
In-context learning (ICL) is one of the surprising and useful features of large language models and subject of intense research. Recently, stylized meta-learning-like ICL setups have been devised that train transformers on sequences of input-output pairs $(x, f(x))$ using the language modeling loss. The function $f$ comes from a function class and generalization is checked by evaluation on sequences for unseen functions from the same class. One of the main discoveries in this line of research has been that for several function classes, such as linear regression, transformers successfully generalize to new functions in the class. However, the inductive biases of these models resulting in this behavior are not clearly understood. A model with unlimited training data and compute is a Bayesian predictor: it learns the pretraining distribution. In this paper we empirically examine how far this Bayesian perspective can help us understand ICL. To this end, we generalize the previous meta-ICL setup to hierarchical meta-ICL setup which involve unions of multiple task families. We instantiate this setup on multiple function families and find that transformers can do ICL in this setting as well. We make some surprising observations: Transformers can learn to generalize to new function classes that were not seen during pretraining. This requires pretraining on a very small number of function classes and involves deviating from the Bayesian predictor on the pretraining distribution. Further, we discover the phenomenon of 'forgetting', where over the course of pretraining under hierarchical meta-ICL setup, the transformer first generalizes to the full distribution of tasks and later forgets it while fitting the pretraining distribution.
|
Madhur Panwar · Kabir Ahuja · Navin Goyal 🔗 |
-
|
Exploring Social Bias in Downstream Applications of Text-to-Image Foundation Models
(
Poster
)
>
link
Text-to-image diffusion models have been adopted into key commercial workflows, such as art generation and image editing. Characterizing the implicit social biases they exhibit, such as gender and racial stereotypes, is a necessary first step in avoiding discriminatory outcomes. While existing studies on social bias focus on image generation, the biases exhibited in alternate applications of diffusion-based foundation models remain under-explored. We propose a framework that uses synthetic images to probe two applications of diffusion models, image editing and classification, for social bias. Using our framework, we uncover meaningful and significant inter-sectional social biases in Stable Diffusion, a state-of-the-art open-source text-to-image model. Our findings caution against the uninformed adoption of text-to-image foundation models for downstream tasks and services. |
Adhithya Prakash Saravanan · Rafal Kocielnik · Roy Jiang · Pengrui Han · Animashree Anandkumar 🔗 |
-
|
How (not) to ensemble LVLMs for VQA
(
Poster
)
>
link
This paper studies ensembling in the era of Large Vision-Language Models (LVLMs). Ensembling is a classical method to combine different models to get increased performance. In the recent work on Encyclopedic-VQA the authors examine a wide variety of models to solve their task: from vanilla LVLMs, to models including the caption as extra context, to models augmented with Lens-based retrieval of Wikipedia pages. Intuitively these models are highly complementary, which should make them ideal for ensembling. Indeed, an oracle experiment shows potential gains from 48.8% accuracy (the best single model) all the way up to 67% (best possible ensemble). So it is a trivial exercise to create an ensemble with substantial real gains. Or is it? |
Lisa Alazraki · Lluis Castrejon · Mostafa Dehghani · Fantine Huot · Jasper Uijlings · Thomas Mensink 🔗 |
-
|
A Natural Experiment on LLM Data Contamination in Code Generation
(
Poster
)
>
link
Recent claims about the impressive abilities of large language models (LLMs) are often supported by evaluating publicly available benchmarks. Since LLMs train on wide swaths of the internet, this practice raises concerns of data contamination, i.e., evaluating on examples that are explicitly or implicitly included in the training data. Data contamination remains notoriously challenging to measure and mitigate, even with partial attempts like controlled experimentation of training data, canary strings, or embedding similarities. In this work, we conduct the first thorough longitudinal analysis of data contamination in LLMs by using the natural experiment of training cutoffs in GPT models to look at benchmarks released over time.Specifically, we consider two code/mathematical problem-solving datasets, Codeforces and Project Euler, and find statistically significant trends among LLM pass rate vs. GitHub popularity and release date that provide strong evidence of contamination. By open-sourcing our dataset, raw results, and evaluation framework, our work paves the way for rigorous analyses of data contamination in modern models. We conclude with a discussion of best practices and future steps for publicly releasing benchmarks in the age of LLMs that train on webscale data. |
Manley Roberts · Himanshu Thakur · Christine Herlihy · Colin White · Samuel Dooley 🔗 |
-
|
Are large language models good annotators?
(
Poster
)
>
link
Numerous Natural Language Processing (NLP) tasks require precisely labeled data to ensure effective model training and achieve optimal performance. However, data annotation is marked by substantial costs and time requirements, especially when requiring specialized domain expertise or annotating a large number of samples. In this study, we investigate the feasibility of employing large language models (LLMs) as replacements for human annotators. We assess the zero-shot performance of various LLMs of different sizes to determine their viability as substitutes. Furthermore, recognizing that human annotators have access to diverse modalities, we introduce an image-based modality using the BLIP-2 architecture to evaluate LLM annotation performance. Among the tested LLMs, Vicuna-13b demonstrates competitive performance across diverse tasks. To assess the potential for LLMs to replace human annotators, we train a supervised model using labels generated by LLMs and compare its performance with models trained using human-generated labels. However, our findings reveal that models trained with human labels consistently outperform those trained with LLM-generated labels. We also highlights the challenges faced by LLMs in multilingual settings, where their performance significantly diminishes for tasks in languages other than English. |
Jay Mohta · Kenan Ak · Yan Xu · Mingwei Shen 🔗 |