Workshop
Socially Responsible Language Modelling Research (SoLaR)
Usman Anwar · David Krueger · Samuel Bowman · Jakob Foerster · Su Lin Blodgett · Roberta Raileanu · Alan Chan · Laura Ruis · Robert Kirk · Yawen Duan · Xin Chen · Kawin Ethayarajh
Room R06-R09 (level 2)
The inaugural Socially Responsible Language Modelling Research (SoLaR) workshop at NeurIPS 2023 is an interdisciplinary gathering that aims to foster responsible and ethical research in the field of language modeling. Recognizing the significant risks and harms [33-37] associated with the development, deployment, and use of language models, the workshop emphasizes the need for researchers to focus on addressing these risks starting from the early stages of development. The workshop brings together experts and practitioners from various domains and academic fields with a shared commitment to promoting fairness, equity, accountability, transparency, and safety in language modeling research. In addition to technical works on socially responsible language modeling research, we also encourage sociotechnical submissions from other disciplines such as philosophy, law, and policy, in order to foster an interdisciplinary dialogue on the societal impacts of LMs.
Schedule
Sat 6:30 a.m. - 7:10 a.m.
|
LLM As A Cultural Interlocutor? Rethinking Socially Aware NLP in Practice
(
Invited Talk
)
>
SlidesLive Video |
Diyi Yang 🔗 |
Sat 7:10 a.m. - 7:15 a.m.
|
Best Paper Talk - Low Resources Language Jailbreak GPT-4
(
Contributed Talk
)
>
SlidesLive Video Authors: Zheng Xin Yong, Cristina Menghini, Stephen Bach |
🔗 |
Sat 7:20 a.m. - 8:00 a.m.
|
Grounded Evaluations for Assessing Real-World Harms
(
Invited Talk
)
>
SlidesLive Video |
Deborah Raji 🔗 |
Sat 8:30 a.m. - 9:30 a.m.
|
Panel on Socially Responsible Language Modelling Research
(
Panel
)
>
SlidesLive Video Moderator: Sara Hooker Panelists: David Bau, Roger Grosse, Vinodkumar Prahakaran, Stella Biderman |
🔗 |
Sat 9:30 a.m. - 10:10 a.m.
|
Economic Disruption and Alignment of LLMs
(
Invited Talk
)
>
SlidesLive Video |
Anton Korinek 🔗 |
Sat 11:30 a.m. - 1:00 p.m.
|
Poster Session
(
Posters
)
>
|
🔗 |
Sat 1:00 p.m. - 1:40 p.m.
|
Can LLMs Keep a Secret and Serve Pluralistic Values? On Privacy and Moral Implications of LLMs
(
Invited Talk
)
>
SlidesLive Video |
Yejin Choi 🔗 |
Sat 2:00 p.m. - 2:40 p.m.
|
Universal Jailbreaks
(
Invited Talk
)
>
SlidesLive Video |
Andy Zou 🔗 |
Sat 2:40 p.m. - 2:45 p.m.
|
Oral 1 - Social Contract AI: Aligning AI Assistants with Implicit Group Norms
(
Contributed Talk
)
>
SlidesLive Video Authors: Jan-Philipp Fränken, Samuel Kwok, Peixuan Ye, Kanishk Gandhi, Dilip Arumugam, Jared Moore, Alex Tamkin, Tobias Gerstenberg, Noah Goodman |
🔗 |
Sat 2:45 p.m. - 2:50 p.m.
|
Oral 2 - Subtle Misogyny Detection and Mitigation: An Expert-Annotated Dataset
(
Contributed Talk
)
>
SlidesLive Video Authors: Anna Richter, Brooklyn Sheppard, Allison Cohen, Elizabeth Smith, Tamara Kneese, Carolyne Pelletier, Ioana Baldini, Yue Dong |
🔗 |
Sat 2:50 p.m. - 3:30 p.m.
|
Can LLMs reason without Chain-of-Thought?
(
Invited Talk
)
>
SlidesLive Video |
Owain Evans 🔗 |
-
|
Prompt Risk Control: A Rigorous Framework for Responsible Deployment of Large Language Models
(
Poster
)
>
link
The recent explosion in the capabilities of large language models has led to a wave of interest in how best to prompt the model to perform a given task. While it may be tempting to choose a prompt based on average empirical results on a validation set, this can lead to a deployment where unexpectedly poor responses are generated. To mitigate this prospect, we propose a lightweight framework, Prompt Risk Control, for selecting a prompt based on rigorous upper bounds on families of informative risk measures. We provide and compare different methods for producing bounds on a diverse set of metrics measuring quantities such as worst-case response and disparities in generation quality across the population of users. In addition, we extend the underlying statistical bounding techniques to accommodate the possibility of distribution shifts in deployment. Experiments on applications such as chatbots, medical question summarization, and code generation highlight how such a framework can reduce the risk of the worst outcomes. |
Thomas Zollo · Todd Morrill · Zhun Deng · Jake Snell · Toniann Pitassi · Richard Zemel 🔗 |
-
|
Weakly Supervised Detection of Hallucinations in LLM Activations
(
Poster
)
>
link
We propose an auditing method to identify whether a large language model (LLM) encodes patterns such as hallucinations in its internal states, which may propagate to downstream tasks. We introduce a weakly supervised auditing technique using a subset scanning approach to detect anomalous patterns in LLM activations from pre-trained models. Importantly, our method does not need knowledge of the type of patterns \emph{a-priori}. Instead, it relies on a reference dataset devoid of anomalies during testing. Further, our approach enables the identification of pivotal nodes responsible for encoding these patterns, which may offer crucial insights for fine-tuning specific sub-networks for bias mitigation. We introduce two new scanning methods to handle LLM activations for anomalous sentences that may deviate from the expected distribution in either direction. Our results confirm prior findings of BERT's limited internal capacity for encoding hallucinations, while OPT appears capable of encoding hallucination information internally. Importantly, our scanning approach, without prior exposure to false statements, performs comparably to a fully supervised out-of-distribution classifier. |
Miriam Rateike · Celia Cintas · John Wamburu · Tanya Akumu · Skyler D. Speakman 🔗 |
-
|
Do Personality Tests Generalize to Large Language Models?
(
Poster
)
>
link
With large language models (LLMs) appearing to behave increasingly human-like in text-based interactions, it has become popular to attempt to evaluate various properties of these models using tests originally designed for humans. While re-using existing tests is a resource-efficient way to evaluate LLMs, careful adjustments are usually required to ensure that test results are even valid across humansub-populations. Thus, it is not clear to what extent different tests’ validity generalizes to LLMs. In this work, we provide evidence that LLMs’ responses to personality tests systematically deviate from typical human responses, implying that these results cannot be interpreted in the same way as human test results. Concretely, reverse-coded items (e.g. “I am introverted” vs “I am extraverted”) are often both answered affirmatively by LLMs. In addition, variation across different prompts designed to “steer” LLMs to simulate particular personality types does not follow the clear separation into five independent personality factors from human samples. In light of these results, we believe it is important to pay more attention to tests’ validity for LLMs before drawing strong conclusions about potentially ill-defined concepts like LLMs’ “personality”. |
Florian E. Dorner · Tom Sühr · Samira Samadi · Augustin Kelava 🔗 |
-
|
MoPe: Model Perturbation-based Privacy Attacks on Language Models
(
Poster
)
>
link
Recent work has shown that Large Language Models (LLMs) can unintentionally leak sensitive information present in their training data. In this paper, we present MoPe ($\textbf{Mo}$del $\textbf{Pe}$rturbations), a new method to identify with high confidence if a given text is in the training data of a pre-trained language model, given white-box access to the models parameters. MoPe adds noise to the model in parameter space and measures the drop in the log-likelihood for a given point $x$, a statistic we show approximates the trace of the Hessian matrix with respect to model parameters. We compare MoPe to existing state-of-the-art loss-based attacks and other attacks based on second-order curvature information (such as the trace of the Hessian with respect to the model input). Across language models ranging from size $70$M to $12$B parameters, we show that MoPe is more effective than existing attacks. We also find that the loss of a point alone is insufficient to determine extractability---there are training points we can recover using our methods that have average loss. This casts some doubt on prior work that uses the loss of a point as evidence of memorization or "unlearning."
|
Jason Wang · Jeffrey Wang · Marvin Li · Seth Neel 🔗 |
-
|
Language Model Detectors Are Easily Optimized Against
(
Poster
)
>
link
The fluency and general applicability of large language models (LLMs) has motivated significant interest in detecting whether a piece of text was written by a language model. While both academic and commercial detectors have been deployed in some settings, particularly education, other research has highlighted the fragility of these systems. In this paper, we demonstrate a data-efficient attack that fine-tunes language models to confuse existing detectors, leveraging recent developments in reinforcement learning of language models. We use the 'human-ness' score (often just a log probability) of various open-source and commercial detectors as a reward function for reinforcement learning, subject to a KL-divergence constraint that the resulting model does not differ significantly from the original. For a 7B parameter Llama-2 model, fine-tuning for under a day reduces the AUROC of the OpenAI RoBERTa-Large detector from 0.84 to 0.62, while perplexity on OpenWebText increases from 8.7 to only 9.0; with a larger perplexity budget, we reduce AUROC to 0.30 (worse than random), with a perplexity increase to 9.9. Similar to traditional adversarial attacks, we find that this increase in `detector evasion' generalizes to other detectors not used during training. In light of our empirical results, we advise against continued reliance on LLM-generated text detectors. |
Charlotte Nicks · Eric Mitchell · Rafael Rafailov · Archit Sharma · Christopher D Manning · Chelsea Finn · Stefano Ermon 🔗 |
-
|
Jailbreaking Language Models at Scale via Persona Modulation
(
Poster
)
>
link
Despite significant efforts to align large language models to produce harmless responses, their safety mechanisms are still vulnerable to prompts that elicit undesirable behaviour: jailbreaks. In this work, we investigate persona modulation as a black-box jailbreak that steers the target model to take on particular personalities (personas) that are more likely to comply with harmful instructions. We demonstrate a range of societally harmful completions made possible by persona modulation, including detailed instructions for synthesising methamphetamine, building a bomb, and laundering money. We show that persona modulation can be automated to exploit this vulnerability at scale. We achieve this by using a novel jailbreak prompt that gets a language model to generate jailbreak prompts for arbitrary topics rather than manually crafting a jailbreak prompt for each persona. Persona modulation leads to high attack success rates against GPT-4 and find that the prompts are fully transferable to other state-of-the-art models such as Claude 2 and Vicuna. Our work expands the attack surface for misuse and highlights new vulnerabilities in large language models. |
Rusheb Shah · Quentin Feuillade Montixi · Soroush Pour · Arush Tagade · Javier Rando 🔗 |
-
|
FlexModel: A Framework for Interpretability of Distributed Large Language Models
(
Spotlight
)
>
link
With the rise of Large Language Models (LLMs) characterized by billions of parameters, the hardware prerequisites for their training and deployment have seen a corresponding increase. Although existing tools facilitate model parallelization and distributed training, deeper model interactions, crucial for interpretability and responsible AI techniques, demand thorough knowledge in distributed computing. This complexity often hampers researchers with machine learning expertise but limited distributed computing background. Addressing this challenge, we present FlexModel, a software package crafted to offer a streamlined interface for engaging with large models across multi-GPU and multi-node configurations. FlexModel is compatible with existing technological frameworks and encapsulates PyTorch models. Its HookFunctions facilitate simple interaction with distributed model internals, bridging the gap between distributed and single-device model handling paradigms. Our work's primary contribution FlexModel democratizes model interactions, and we validate it in two large-scale experimental contexts: Transformer Induction Head Isolation and the TunedLens implementation. FlexModel enhances accessibility and promotes more inclusive research in the domain of large-scale neural networks. |
Matthew Choi · Muhammad Adil Asif · John Willes · David B. Emerson 🔗 |
-
|
Large Language Model Unlearning
(
Poster
)
>
link
We study how to perform unlearning in large language models (LLMs), which can forget an LLM's harmful behaviors learned in its pretraining stage or remove the effect of training samples that need to be deleted per user requests. It highlights the application of aligning LLMs with human preferences. Compared to the standard RLHF (RL from human feedback) solution for aligning LLMs, unlearning has three benefits. (1) It only requires negative examples, which are cheaper to collect than high-quality (i.e. positive) examples in RLHF that require human effort. (2) It is less computationally expensive; the cost is comparable to fine-tuning. (3) It is more effective when we know which training samples cause the misbehavior. To the best of our knowledge, our work is the first to explore LLM unlearning, as well as to set up the settings, goals, and evaluations in LLM unlearning. Our empirical results suggest unlearning is a promising direction for LLM alignment. |
Yuanshun (Kevin) Yao · Xiaojun Xu · Yang Liu 🔗 |
-
|
FairSISA: Ensemble Post-Processing to Improve Fairness of Unlearning in LLMs
(
Poster
)
>
link
Training large language models (LLMs) is a costly endeavour in terms of time and computational resources. The large amount of training data used during the unsupervised pre-training phase makes it difficult to verify all data and, unfortunately, undesirable data may be ingested during training. Re-training from scratch is impractical and has led to the creation of the \textit{unlearning} discipline where models are modified to ``unlearn" undesirable information without retraining. However, any modification can alter the behaviour of LLMs, especially on key dimensions such as \textit{fairness}. This is the first work that examines this interplay between unlearning and fairness for LLMs. In particular, we focus on a popular unlearning framework known as SISA [Bourtoule et al., 2021], which creates an ensemble of models trained on disjoint shards. We evaluate the performance-fairness trade-off for SISA, and empirically demsontrate that SISA can indeed reduce fairness in LLMs. To remedy this, we propose post-processing bias mitigation techniques for ensemble models produced by SISA. Through experimental results, we demonstrate the efficacy of our post-processing framework called \textit{FairSISA}. |
Swanand Kadhe · Anisa Halimi · Ambrish Rawat · Nathalie Baracaldo 🔗 |
-
|
Efficient Evaluation of Bias in Large Language Models through Prompt Tuning
(
Poster
)
>
link
Prompting large language models (LLMs) has gained substantial popularity as pre-trained LLMs are capable of performing downstream tasks without requiring large quantities of labelled data. It is, therefore, natural that prompting is also used to evaluate biases exhibited by these models. However, achieving good task-specific performance often requires manual prompt optimization. In this paper, we explore the use of soft-prompt tuning to quantify the biases of LLMs such as OPT and LLaMA. These models are trained on real-world data with potential implicit biases toward certain groups. Since LLMs are increasingly used across many industries and applications, it is crucial to accurately and efficiently identify such biases and their practical implications.In this paper, we use soft-prompt tuning to evaluate model bias across several sensitive attributes through the lens of group fairness (bias). In addition to improved task performance, using soft-prompt tuning provides the advantage of avoiding potential injection of human bias through manually designed prompts. Probing with prompt-tuning reveals important bias patterns, including disparities across age and sexuality. |
Jacob-Junqi Tian · David B. Emerson · Deval Pandya · Laleh Seyyed-Kalantari · Faiza Khattak 🔗 |
-
|
Dissecting Large Language Models
(
Poster
)
>
link
Understanding and shaping the behaviour of Large Language Models (LLMs) is increasingly important as applications become more powerful and more frequently adopted. This paper introduces a machine unlearning method specifically designed for LLMs. We introduce a selective pruning method for LLMs that removes neurons based on their relative importance on a targeted capability compared to overall network performance. This approach is a compute- and data-efficient method for identifying and removing neurons that enable specific behaviours. Our findings reveal that both feed-forward and attention neurons in LLMs are specialized; that is, for specific tasks, certain neurons are more crucial than others. |
Nicky Pochinkov · Nandi Schoots 🔗 |
-
|
Comparing Optimization Targets for Contrast-Consistent Search
(
Poster
)
>
link
We investigate the optimization target of contrast-consistent search (CCS), which aims to recover the internal representations of truth of a large language model. We present a new loss function that we call the Midpoint-Displacement (MD) loss function. We demonstrate that for a certain hyper-parameter value this MD loss function leads to a prober with very similar weights to CCS. We further show that this hyper-parameter is not optimal and that with a better hyper-parameter the MD loss function tentatively attains a higher test accuracy than CCS. |
Hugo Fry · Seamus Fallows · Jamie Wright · Ian Fan · Nandi Schoots 🔗 |
-
|
AutoDAN: Automatic and Interpretable Adversarial Attacks on Large Language Models
(
Poster
)
>
link
Large Language Models (LLMs) exhibit broad utility in diverse applications but remain vulnerable to jailbreak attacks, including hand-crafted and automated adversarial attacks, which can compromise their safety measures. However, recent work suggests that patching LLMs against these attacks is possible: manual jailbreak attacks are human-readable but often limited and public, making them easy to block, while automated adversarial attacks generate gibberish prompts that can be detected using perplexity-based filters. In this paper, we propose an interpretable adversarial attack, \texttt{AutoDAN}, that combines the strengths of both types of attacks. It automatically generates attack prompts that bypass perplexity-based filters while maintaining a high attack success rate like manual jailbreak attacks. These prompts are interpretable, exhibiting strategies commonly used in manual jailbreak attacks. Moreover, these interpretable prompts transfer better than their non-readable counterparts, especially when using limited data or a single proxy model. Beyond eliciting harmful content, we also customize the objective of \texttt{AutoDAN} to leak system prompts, demonstrating its versatility. Our work underscores the seemingly intrinsic vulnerability of LLMs to interpretable adversarial attacks. |
Sicheng Zhu · Ruiyi Zhang · Bang An · Gang Wu · Joe Barrow · Zichao Wang · Furong Huang · Ani Nenkova · Tong Sun 🔗 |
-
|
Low-Resource Languages Jailbreak GPT-4
(
Spotlight
)
>
link
AI safety training and red-teaming of large language models (LLMs) are measures to mitigate the generation of unsafe content. Our work exposes the inherent cross-lingual vulnerability of these safety mechanisms, resulting from the linguistic inequality of safety training data, by successfully circumventing GPT-4's safeguard through translating unsafe English inputs into low-resource languages. On the AdvBenchmark, GPT-4 engages with the unsafe translated inputs and provides actionable items that can get the users towards their harmful goals 79% of the time, which is on par with or even surpassing state-of-the-art jailbreaking attacks. Other high-/mid-resource languages have significantly lower attack success rate, which suggests that the cross-lingual vulnerability mainly applies to low-resource languages. Previously, limited training on low-resource languages primarily affects speakers of those languages, causing technological disparities. However, our work highlights a crucial shift: this deficiency now poses a risk to all LLMs users. Publicly available translation APIs enable anyone to exploit LLMs' safety vulnerabilities. Therefore, our work calls for a more holistic red-teaming efforts to develop robust multilingual safeguards with wide language coverage. |
Yong Zheng-Xin · Cristina Menghini · Stephen Bach 🔗 |
-
|
Post-Deployment Regulatory Oversight for General-Purpose Large Language Models
(
Poster
)
>
link
The development and deployment of increasingly capable, general-purpose large language models (LLMs) has led to a wide array of risks and harms from automation that are correlated across sectors and use cases. Effective regulation and oversight of general-purpose AI (GPAI) requires the ability to monitor, investigate, and respond to risks and harms that appear across use cases, as well as hold upstream developers accountable for downstream harms that result from their decisions and practices. We argue that existing processes for sector-specific AI oversight in the U.S. should be complemented by post-deployment oversight to address risks and harms specifically from GPAI usage, which may require a new AI-focused agency. We examine oversight processes implemented by other federal agencies as precedents for the GPAI oversight activities that an AI agency can conduct. The post-deployment oversight function of an AI agency can complement other regulatory functions that it may perform which are discussed elsewhere in the literature, including pre-deployment licensing or model evaluations for LLMs. |
Carson Ezell · Abraham Loeb 🔗 |
-
|
Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment
(
Poster
)
>
link
Ensuring alignment has become a critical task before deploying large language models (LLMs) in real-world applications. A major challenge faced by practitioners is the lack of clear guidance on evaluating whether LLM outputs align with social norms, values, and regulations. This obstacle hinders the systematic iteration and deployment of LLMs. To address this issue, this paper presents a comprehensive survey of key dimensions that are crucial to consider when assessing LLM trustworthiness. The survey covers 7 major categories of LLM trustworthiness: reliability, safety, fairness, resistance to misuse, explainability and reasoning, adherence to social norms, and robustness. Each major category is further divided into several sub-categories, resulting in a total of 29 sub-categories. Additionally, a subset of 8 sub-categories is selected for further investigation, where corresponding measurement studies are designed and conducted on several widely-used LLMs. The measurement results indicate that, in general, more aligned models tend to perform better in terms of overall trustworthiness. However, the effectiveness of alignment varies across the different trustworthiness categories considered. This highlights the importance of conducting more fine-grained analyses, testing, and making continuous improvements on LLM alignment. By shedding light on these key dimensions of LLM trustworthiness, this paper aims to provide valuable insights and guidance to practitioners in the field. Understanding and addressing these concerns will be crucial in achieving reliable and ethically sound deployment of LLMs in various applications. |
Yang Liu · Yuanshun (Kevin) Yao · Jean-Francois Ton · Xiaoying Zhang · Ruocheng Guo · Hao Cheng · Yegor Klochkov · Muhammad Faaiz Taufiq · Hang Li 🔗 |
-
|
Are Large Language Models Really Robust to Word-Level Perturbations?
(
Poster
)
>
link
The swift advancement in the scales and capabilities of Large Language Models (LLMs) positions them as promising tools for a variety of downstream tasks. In addition to the pursuit of better performance and the avoidance of violent feedback on a certain prompt, to ensure the responsibility of the LLM, much attention is drawn to the robustness of LLMs. However, existing evaluation methods mostly rely on traditional question answering datasets with predefined supervised labels, which do not align with the superior generation capabilities of contemporary LLMs. To address this issue, we propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools to evaluate the longer conversation generated from more challenging open questions by LLMs, which we refer to as the $R$eward Model for $R$easonable $R$obustness $Eval$uation ($TREvaL$). Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions, a capability not entirely encompassed by individual words or letters, which may exhibit oversimplification and inherent biases. Our extensive empirical experiments demonstrate that TREvaL provides an innovative method for evaluating the robustness of LLMs. Furthermore, our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage. Notably, we are surprised to discover that robustness tends to decrease as fine-tuning (SFT and RLHF) is conducted.
|
Haoyu Wang · Guozheng Ma · Cong Yu · Gui Ning · Linrui Zhang · Zhiqi Huang · Suwei Ma · Yongzhe Chang · Sen Zhang · Li Shen · Xueqian Wang · Peilin Zhao · Dacheng Tao
|
-
|
Eliciting Language Model Behaviors using Reverse Language Models
(
Spotlight
)
>
link
Language models (LMs) are used on an increasingly broad set of tasks. However, models still exhibit erratic behaviors on specific inputs, including adversarial attacks and jailbreaks. We evaluate the applicability of a reverse language model, pre-trained on inverted token-order, as a tool for automated identification of an LM's natural language failure modes. Our findings suggest that, despite the inherent difficulty of reverse prediction, reverse LMs can efficiently identify natural language prompts that produce specified outputs, outperforming gradient-based techniques. Our results suggest reverse LMs would be effective tools for finding natural language prompts on which LMs produce incorrect or toxic responses. |
Jacob Pfau · Alex Infanger · Abhay Sheshadri · Ayush Panda · Julian Michael · Curtis Huebner 🔗 |
-
|
Controlled Decoding from Language Models
(
Spotlight
)
>
link
We propose controlled decoding (CD), a novel off-policy reinforcement learning method to control the autoregressive generation from language models towards high reward outcomes. CD solves an off-policy reinforcement learning problem through a value function for the reward, which we call a prefix scorer. The prefix scorer is used at inference time to steer the generation towards higher reward outcomes. We show that the prefix scorer may be trained on (possibly) off-policy data to predict the expected reward when decoding is continued from a partially decoded response. We empirically demonstrate that CD is effective as a control mechanism on Reddit conversations corpus. We also show that the modularity of the design of CD makes it possible to control for multiple rewards effectively solving a multi-objective reinforcement learning problem with no additional complexity. Finally, we show that CD can be applied in a novel blockwise fashion at inference-time, again without the need for any training-time changes, essentially bridging the gap between the popular sequence-level best-of-k strategy and token-level reinforcement learning. This makes CD a promising approach for alignment of language models. |
Sidharth Mudgal · Jong Lee · Harish Ganapathy · YaGuang Li · Tao Wang · Yanping Huang · zhifeng Chen · Heng-Tze Cheng · Michael Collins · Jilin Chen · Alex Beutel · Ahmad Beirami
|
-
|
The Effect of Group Status on the Variability of Group Representations in LLM-generated Text
(
Poster
)
>
link
Large Language Models (LLMs) have become pervasive in everyday life, yet their inner workings remain opaque. While scholarly efforts have demonstrated LLMs’ propensity to reproduce biases in their training data, they have primarily focused on the association of social groups with stereotypic attributes. In this paper, we extend this line of inquiry to investigate a bias akin to the social-psychological phenomenon where socially dominant groups are perceived to be less homogeneous than socially subordinate groups as it is reproduced by LLMs. We had ChatGPT, a state-of-the-art LLM, generate a diversity of texts about intersectional group identities and compared text homogeneity. We consistently find that LLMs portray African, Asian, and Hispanic Americans as more homogeneous than White Americans. They also portray women as more homogeneous than men, but these differences are small. Finally, we find that the effect of gender differs across racial/ethnic groups such that the effect of gender is consistent and substantively meaningful only among Hispanic Americans. We speculate possible sources of this bias in LLMs and posit that the bias has the potential to amplify biases in future LLM training and to reinforce stereotypes. |
Messi Lee · Calvin Lai · Jacob Montgomery 🔗 |
-
|
Learning Inner Monologue and Its Utilization in Vision-Language Challenges
(
Poster
)
>
link
Inner monologue is an essential phenomenon for reasoning and insight mining in human cognition. In this work, we propose a novel approach to simulate inner monologue. Specifically, we explore how inner monologue reasoning can be utilized to solve complex vision-language problems. Driven by the power of Large Language Models (LLMs), two prominent methods for vision-language tasks have emerged: (1) the hybrid integration between LLMs and Vision-Language Models (VLMs), where visual inputs are firstly converted into language descriptions by VLMs, serving as inputs for LLMs to generate final answer(s); (2) visual feature alignment in language space, where visual inputs are encoded as embeddings and projected to LLMs' language space via further supervised fine-tuning. The first approach provides light training costs and interpretability but is hard to be optimized in an end-to-end fashion. The second approach presents decent performance, but feature alignment usually requires large amounts of training data and lacks interpretability. With inner monologue simulation, our approach achieves competitive performance with less training data and promising interpretability when compared with state-of-the-art models on two popular tasks. |
Diji Yang · Kezhen Chen · Jinmeng Rao · Xiaoyuan Guo · Yawen Zhang · Jie Yang · Yi Zhang 🔗 |
-
|
Reinforcement Learning Fine-tuning of Language Models is Biased Towards More Extractable Features
(
Poster
)
>
link
Many capable large language models (LLMs) are developed via self-supervised pre-training followed by a reinforcement-learning fine-tuning phase, often based on human or AI feedback. During this stage, models may be guided by their inductive biases to rely on simpler features which may be easier to extract, at a cost to robustness and generalisation. We investigate whether principles governing inductive biases in the supervised fine-tuning of LLMs also apply when the fine-tuning process uses reinforcement learning. Following Lovering et al (2021), we test two hypotheses: that features more $\textit{extractable}$ after pre-training are more likely to be utilised by the final policy, and that the evidence for/against a feature predicts whether it will be utilised. Through controlled experiments on synthetic and natural language tasks, we find statistically significant correlations which constitute strong evidence for these hypotheses.
|
Diogo Cruz · Edoardo Pona · Alex Holness-Tofts · Elias Schmied · Víctor Abia Alonso · Charlie J Griffin · Bogdan-Ionut Cirstea 🔗 |
-
|
Bridging Predictive Minds: LLMs As Atypical Active Inference Agents
(
Poster
)
>
link
Large Language Models (LLMs) like GPT are often conceptualized as passive predictors, simulators, or even ’stochastic parrots’. We explore a novel conceptualization of LLMs, drawing on the theory of active inference originating in cognitive science and neuroscience. We examine similarities and differences between traditional active inference systems and LLMs, leading to the conclusion that currently LLMs lack a tight feedback loop between acting in the world and perceiving the impacts of their actions, but otherwise fit in the paradigm. We list a reasons why the loop may get closed, and possible consequences of this, including enhanced model self-awareness and the drive to minimize prediction error by changing the world. |
Jan Kulveit 🔗 |
-
|
Probing Explicit and Implicit Gender Bias through LLM Conditional Text Generation
(
Poster
)
>
link
Large Language Models (LLMs) can generate biased and toxic responses. Yet most prior work on LLM gender bias evaluation requires predefined gender-related phrases or gender stereotypes, which are challenging to be comprehensively collected and are limited to explicit bias evaluation. In this work, we propose a conditional text generation mechanism without the need for predefined gender phrases and stereotypes. This approach employs three types of inputs generated through three distinct strategies to probe LLMs, aiming to show evidence of explicit and implicit gender biases in LLMs. We also utilize explicit and implicit evaluation metrics to evaluate gender bias in LLMs under different strategies. Our experiments demonstrate that an increased model size does not consistently lead to enhanced fairness and all tested LLMs demonstrate explicit and/or implicit gender bias. |
Xiangjue Dong · Yibo Wang · Philip S Yu · James Caverlee 🔗 |
-
|
A Simple Test of Expected Utility Theory with GPT
(
Spotlight
)
>
link
This paper tests GPT (specifically, GPT-3.5 with the model variant text-davinci-003) with one of the most classic behavioral choice experiments -- the Allais paradox, to understand the mechanism behind GPT's choices. The Allais paradox is well-known for exposing the irrationality of human choices. Our result shows that, like humans, GPT also falls into the trap of the Allais paradox by violating the independence axiom of the expected utility theory, indicating that its choices are irrational. However, GPT violates the independence axiom in the opposite direction compared to human subjects. Specifically, human subjects tend to be more risk-seeking in the event of an opportunity gain, while GPT displays more risk aversion. This observation implies that GPT's choices structurally differ from those of humans, which might serve as a caveat for developers using LLM to generate human-like data or assist human decision-making. |
Mengxin Wang 🔗 |
-
|
Towards Auditing Large Language Models: Improving Text-based Stereotype Detection
(
Poster
)
>
link
Large Language Models (LLM) have made significant advances in the recent past becoming more mainstream in Artificial Intelligence (AI) enabled human-facing applications. However, LLMs often generate stereotypical output, taking from their training data, amplifying societal biases and raising ethical concerns. This work introduces i) the Multi-Grain Stereotype Dataset, which includes 52,751 instances of gender, race, profession and religion stereotypic text and ii) a novel stereotype classifier for English text. We design several experiments to rigorously test the proposed model trained on the novel dataset. Our experiments show that training the model in a multi-class setting can outperform the one-vs-all binary counterpart. Consistent feature importance signals from different eXplainable AI tools demonstrate that the new model exploits relevant text features. We utilise the newly created model to assess the stereotypic behaviour of the popular GPT family of models and observe the reduction of bias over time. In summary, our work establishes a robust and practical framework for auditing and evaluating the stereotypic bias in LLMs. |
Zekun Wu · Sahan Bulathwela · Adriano Koshiyama 🔗 |
-
|
Welfare Diplomacy: Benchmarking Language Model Cooperation
(
Poster
)
>
link
The growing capabilities and increasingly widespread deployment of AI systems necessitate robust benchmarks for measuring their cooperative capabilities. Unfortunately, most multi-agent benchmarks are either zero-sum or purely cooperative, providing limited opportunities for such measurements. We introduce a general-sum variant of the zero-sum board game Diplomacy—called Welfare Diplomacy—in which players must balance investing in military conquest and domestic welfare. We argue that Welfare Diplomacy facilitates both a clearer assessment of and stronger training incentives for cooperative capabilities. Our contributions are: (1) proposing the Welfare Diplomacy rules and implementing them via an open- source Diplomacy engine; (2) constructing baseline agents using zero-shot prompted language models; and (3) conducting experiments where we find that baselines using state-of-the-art models attain high social welfare but are exploitable. Our work aims to promote societal safety by aiding researchers in developing and assessing multi-agent AI systems. Code to evaluate Welfare Diplomacy and reproduce our experiments is available at https://anonymous.4open.science/r/welfare-diplomacy-72AC. |
Gabe Mukobi · Hannah Erlebach · Niklas Lauffer · Lewis Hammond · Alan Chan · Jesse Clifton 🔗 |
-
|
A Divide-Conquer-Reasoning Approach to Consistency Evaluation and Improvement in Blackbox Large Language Models
(
Poster
)
>
link
Evaluating the quality and variability of text generated by Large Language Models (LLMs) poses a significant, yet unresolved research challenge. Traditional evaluation methods, such as ROUGE and BERTScore, which measure token similarity, often fail to capture the holistic semantic equivalence. This results in a low correlation with human judgments and intuition, which is especially problematic in high-stakes applications like healthcare and finance where reliability, safety, and robust decision-making are highly critical. This work proposes an automated framework for evaluating the consistency of LLM-generated texts using a divide-and-conquer strategy. Unlike existing LLM-based evaluators that operate at the paragraph level, our method employs a divide-and-conquer evaluator (DCE) that breaks down the comparison between two generated responses into individual sentences, each evaluated based on predefined criteria. To facilitate this approach, we introduce an automatic metric converter (AMC) that translates the output from DCE into an interpretable numeric score. Beyond the consistency evaluation, we further present a reason-assisted improver (RAI) that leverages the analytical reasons with explanations identified by DCE to generate new responses aimed at reducing these inconsistencies. Through comprehensive and systematic empirical analysis, we show that our approach outperforms state-of-the-art methods by a large margin (e.g., +19.3% and +24.3% on the SummEval dataset) in evaluating the consistency of LLM generation across multiple benchmarks in semantic, factual, and summarization consistency tasks. Our approach also substantially reduces nearly 90% output inconsistencies, showing promise for effective hallucination mitigation and reduction. |
Wendi Cui · Jiaxin Zhang · Zhuohang Li · Damien Lopez · Kamalika Das · Bradley Malin · Sricharan Kumar 🔗 |
-
|
Compositional preference models for alignment with scalable oversight
(
Spotlight
)
>
link
As language models (LMs) become more capable, it is increasingly important to align them with human preferences. However, the dominant paradigm for training Preference Models (PMs) for that purpose suffers from fundamental limitations, such as lack of transparency and scalability, along with susceptibility to overfitting the preference dataset. We propose Compositional Preference Models (CPMs), a novel PM framework that decomposes one global preference assessment into several interpretable features, obtains scalar scores for these features from a prompted LM, and aggregates these scores using a logistic regression classifier. CPMs allow to control which properties of the preference data are used to train the preference model and to build it based on features that are believed to underlie the human preference judgement. Our experiments show that CPMs not only improve interpretability and are more robust to overoptimization than standard PMs, but also that best-of-$n$ samples obtained using CPMs tend to be preferred over samples obtained using conventional PMs. Overall, our approach demonstrates the benefits of endowing PMs with priors about which features determine human preferences while relying on LM capabilities to extract those features in a scalable and interpretable way.
|
Dongyoung Go · Tomasz Korbak · Germán Kruszewski · Jos Rozen · Marc Dymetman 🔗 |
-
|
Investigating the Fairness of Large Language Models for Predictions on Tabular Data
(
Poster
)
>
link
Recent literature has suggested the potential of using large language models (LLMs) to make predictions for tabular tasks. However, LLMs have been shown to exhibit harmful social biases that reflect the stereotypes and inequalities present in the society. To this end, as well as the widespread use of tabular data in many high-stake applications, it is imperative to explore the following questions: what sources of information do LLMs draw upon when making predictions for tabular tasks; whether and to what extent are LLM predictions for tabular tasks influenced by social biases and stereotypes; and what are the consequential implications for fairness? Through a series of experiments, we delve into these questions and show that LLMs tend to inherit social biases from their training data which significantly impact their fairness in tabular prediction tasks. Furthermore, our investigations show that in the context of bias mitigation, though in-context learning and fine-tuning have a moderate effect, the fairness metric gap between different subgroups is still larger than that in traditional machine learning models, such as Random Forest and shallow Neural Networks. This observation emphasizes that the social biases are inherent within the LLMs themselves and inherited from their pre-training corpus, not only from the downstream task datasets. Besides, we demonstrate that label-flipping of in-context examples can significantly reduce biases, further highlighting the presence of inherent bias within LLMs. |
Yanchen Liu · Srishti Gautam · Jiaqi Ma · Himabindu Lakkaraju 🔗 |
-
|
Localizing Lying in Llama: Experiments in Prompting, Probing, and Patching
(
Poster
)
>
link
Large language models (LLMs) demonstrate significant knowledge through their outputs, though it is often unclear whether undesirable outputs are due to a lack of knowledge or dishonesty. In this paper, we conduct an extensive study of intentional dishonesty in Llama-2-70b-chat by engineering prompts that instruct it to lie and then use mechanistic interpretability approaches to localize where in the network this lying behavior occurs. We consistently find five layers in the model that are highly important for lying using three independent methodologies (probing, patching, and concept erasure). We then successfully perform causal interventions on only 46 attention heads (or less than 1\% of all heads in the network), causing the lying model to act honestly. These interventions work robustly across four prompts and six dataset splits. We hope our work can help understand and thus prevent lying behavior in LLMs. |
James Campbell · Phillip Guo · Richard Ren 🔗 |
-
|
User Inference Attacks on LLMs
(
Poster
)
>
link
We study the privacy implications of fine-tuning large language models (LLMs) on user-stratified data. We define a realistic threat model, called user inference, wherein an attacker infers whether or not a user's data was used for fine-tuning. We implement attacks for this threat model that require only a small set of samples from a user (possibly different from the samples used for training) and black-box access to the fine-tuned LLM. We find that LLMs are susceptible to user inference attacks across a variety of fine-tuning datasets with outlier users (i.e. those with data distributions sufficiently different from other users) and users who contribute large quantities of data being most susceptible. Finally, we find that mitigation interventions in the training algorithm, such as batch or per-example gradient clipping and early stopping fail to prevent user inference while limiting the number of fine-tuning samples from a single user can reduce attack effectiveness (albeit at the cost of reducing the total amount of fine-tuning data). |
Nikhil Kandpal · Krishna Pillutla · Alina Oprea · Peter Kairouz · Christopher A. Choquette-Choo · Zheng Xu 🔗 |
-
|
Interpretable Stereotype Identification through Reasoning
(
Poster
)
>
link
Given that language models are trained on vast datasets that may contain inherent biases, there is a potential danger of inadvertently perpetuating systemic discrimination. Consequently, it becomes essential to examine and address biases in language models, integrating fairness into their development to ensure that these models are equitable and free of bias. In this work, we demonstrate the importance of reasoning in zero-shot stereotype identification based on Vicuna-13B & -33B and LLaMA-2-chat-13B & -70B. Although we observe improved accuracy by scaling from 13B to larger models, we show that the performance gain from reasoning significantly exceeds the gain from scaling up. Our findings suggest that reasoning is a key factor that enables LLMs to transcend the scaling law on out-of-domain tasks such as stereotype identification. Additionally, through a qualitative analysis of select reasoning traces, we highlight how reasoning improves not just accuracy, but also the interpretability of the decision. |
Jacob-Junqi Tian · Omkar Dige · David B. Emerson · Faiza Khattak 🔗 |
-
|
Hazards from Increasingly Accessible Fine-Tuning of Downloadable Foundation Models
(
Spotlight
)
>
link
Public release of the weights of pre-trained foundation models, otherwise known as downloadable access \citep{solaimangradient2023}, enables fine-tuning without the prohibitive expense of pre-training. Our work argues that increasingly accessible fine-tuning of downloadable models will likely increase hazard. First, we highlight research to improve the accessibility of fine-tuning. We split our discussion into research that A) reduces the computational cost of fine-tuning and B) improves the ability to share that cost across more actors. Second, we argue that more accessible fine-tuning methods would increase hazard through enabling malicious, non-state actors and diffusing responsibility for harms. We conclude with a discussion of the limitations of our work, notably that we do not evaluate the potential benefits of more accessible fine-tuning or the effects on vulnerability or exposure. |
Alan Chan · Benjamin Bucknall · Herbie Bradley · David Krueger 🔗 |
-
|
Developing A Conceptual Framework for Analyzing People in Unstructured Data
(
Poster
)
>
link
Unstructured data used in foundation model development is a challenge for systematic analyses to make data use and documentation decisions. From a Responsible AI perspective, these decisions often rely upon understanding how people are represented in data. We propose a framework to guide analysis of human representation in unstructured data and identify downstream risks. |
Mark Díaz · Sunipa Dev · Emily Reif · Remi Denton · Vinodkumar Prabhakaran 🔗 |
-
|
Breaking Physical and Linguistic Borders: Privacy-Preserving Multilingual Prompt Tuning for Low-Resource Languages
(
Spotlight
)
>
link
Pretrained large language models (LLMs) have emerged as a cornerstone in modern natural language processing, with their utility expanding to various applications and languages. However, the fine-tuning of multilingual LLMs, particularly for low-resource languages, is fraught with challenges steming from data-sharing restrictions (the physical border) and from the inherent linguistic differences (the linguistic border). These barriers hinder users of various languages, especially those in low-resource regions, from fully benefiting from the advantages of LLMs.To overcome these challenges, we propose the Federated Prompt Tuning Paradigm for Multilingual Scenarios, which leverages parameter-efficient fine-tuning in a manner that preserves user privacy. We have designed a comprehensive set of experiments and introduced the concept of "language distance" to highlight the strengths of this paradigm: Even under computational constraints, our method not only bolsters data efficiency but also facilitates mutual enhancements across languages, particularly benefiting low-resource ones. Compared to traditional local crosslingual transfer tuning methods, our approach achieves a 6.9\% higher accuracy, reduces the training parameters by over 99\%, and demonstrates stronger cross-lingual generalization. Such findings underscore the potential of our approach to promote social equality, ensure user privacy, and champion linguistic diversity. |
Wanru Zhao · Yihong Chen 🔗 |
-
|
Measuring Feature Sparsity in Language Models
(
Spotlight
)
>
link
Recent works have proposed that activations in language models can be modelled as sparse linear combinations of vectors corresponding to features of input text. Under this assumption, these works aimed to reconstruct feature directions using sparse coding. We develop metrics to assess the success of these sparse coding techniques and test the validity of the linearity and sparsity assumptions. We show our metrics can predict the level of sparsity on synthetic sparse linear activations, and can distinguish between sparse linear data and several other distributions. We use our metrics to measure levels of sparsity in several language models. We find evidence that language model activations can be accurately modelled by sparse linear combinations of features, significantly more so than control datasets. We also show that model activations appear to be sparsest in the first and final layers. |
Mingyang Deng · Lucas Tao · Joe Benton 🔗 |
-
|
Beyond Reverse KL: Generalizing Direct Preference Optimization with Diverse Divergence Constraints
(
Poster
)
>
link
The increasing capabilities of large language models (LLMs) raise opportunities for artificial general intelligence but concurrently amplify safety concerns, such as potential misuse of AI systems, necessitating effective AI alignment. Reinforcement Learning from Human Feedback (RLHF) has emerged as a promising pathway towards AI alignment but brings forth challenges due to its complexity and dependence on a separate reward model. Direct Preference Optimization (DPO) has been proposed as an alternative; and it remains equivalent to RLHF under the reverse KL regularization constraint. This paper presents $f$-DPO, a generalized approach to DPO by incorporating diverse divergence constraints. We show that under certain $f$-divergences, including Jensen-Shannon divergence, forward KL divergences and $\alpha$-divergences, the complex relationship between the reward and optimal policy can also be simplified by addressing the Karush–Kuhn–Tucker conditions. This eliminates the need for estimating the normalizing constant in the Bradley-Terry model and enables a tractable mapping between the reward function and the optimal policy. Our approach optimizes LLMs to align with human preferences in a more efficient and supervised manner under a broad set of divergence constraints. Empirically, adopting these divergences ensures a balance between alignment performance and generation diversity. Importantly, our $f$-DPO outperforms PPO-based methods in divergence efficiency, and divergence constraints directly influence expected calibration error (ECE).
|
Chaoqi Wang · Yibo Jiang · Chenghao Yang · Han Liu · Yuxin Chen 🔗 |
-
|
Social Contract AI: Aligning AI Assistants with Implicit Group Norms
(
Spotlight
)
>
link
We explore the idea of aligning an AI assistant by inverting a model of users' (unknown) preferences from observed interactions. To validate our proposal, we run proof-of-concept simulations in the economic ultimatum game, formalizing user preferences as policies that guide the actions of simulated players. We find that the AI assistant accurately aligns its behavior to match standard policies from the economic literature (e.g., selfish, altruistic). However, the assistant’s learned policies lack robustness and exhibit limited generalization in an out-of-distribution setting when confronted with a currency (e.g., grams of medicine) that was not included in the assistant's training distribution. Additionally, we find that when there is inconsistency in the relationship between language use and an unknown policy (e.g., an altruistic policy combined with rude language), the assistant's learning of the policy is slowed. Overall, our preliminary results suggest that developing simulation frameworks in which AI assistants need to infer preferences from diverse users can provide a valuable approach for studying practical alignment questions. |
Jan-Philipp Fraenken · Samuel Kwok · Peixuan Ye · Kanishk Gandhi · Dilip Arumugam · Jared Moore · Alex Tamkin · Tobias Gerstenberg · Noah Goodman 🔗 |
-
|
Evaluating Superhuman Models with Consistency Checks
(
Spotlight
)
>
link
If machine learning models were to achieve superhuman abilities at various reasoning or decision-making tasks, how would we go about evaluating such models, given that humans would necessarily be poor proxies for ground truth? In this paper, we propose a framework for evaluating superhuman models via consistency checks. Our premise is that while the correctness of superhuman decisions may be impossible to evaluate, we can still surface mistakes if the model's decisions fail to satisfy certain logical, human-interpretable rules. We investigate two tasks where correctness of decisions is hard to verify: due to either superhuman model abilities, or to otherwise missing ground truth: evaluating chess positions and forecasting future events. Regardless of a model's (possibly superhuman) performance on these tasks, we can discover logical inconsistencies in decision making: a chess engine assigning opposing valuations to semantically identical boards; or GPT-4 forecasting that sports records will evolve non-monotonically over time. |
Lukas Fluri · Daniel Paleka · Florian Tramer 🔗 |
-
|
Testing Language Model Agents Safely in the Wild
(
Poster
)
>
link
A prerequisite for safe autonomy-in-the-wild is safe testing-in-the-wild. Yet real-world autonomous tests face several unique safety challenges, both due to the possibility of causing harm during a test, as well as the risk of encountering new unsafe agent behavior through interactions with real-world and potentially malicious actors. We propose a framework for conducting safe autonomous agent tests on the open internet: agent actions are audited by a context-sensitive monitor that enforces a stringent safety boundary to stop an unsafe test, with suspect behavior ranked and logged to be examined by humans. We a design a basic safety monitor that is flexible enough to monitor existing LLM agents, and, using an adversarial simulated agent, we measure its ability to identify and stop unsafe situations. Then we apply the safety monitor on a battery of real-world tests of AutoGPT, and we identify several limitations and challenges that will face the creation of safe in-the-wild tests as autonomous agents grow more capable. |
Silen Naihin · David Atkinson · Marc Green · Merwane Hamadi · Craig Swift · Douglas Schonholtz · Adam Tauman Kalai · David Bau 🔗 |
-
|
KoMultiText: Large-Scale Korean Text Dataset for Classifying Biased Speech in Real-World Online Services
(
Poster
)
>
link
With the growth of online services, the need for advanced text classification algorithms, such as sentiment analysis and biased text detection, has become increasingly evident. The anonymous nature of online services often leads to the presence of biased and harmful language, posing challenges to maintaining the health of online communities. This phenomenon is especially relevant in South Korea, where large-scale hate speech detection algorithms have not yet been broadly explored. In this paper, we introduce a new comprehensive, large-scale dataset collected from a well-known South Korean SNS platform. Our proposed dataset provides annotations including (1) Preferences, (2) Profanities, and (3) Nine types of Bias for the text samples, enabling multi-task learning for simultaneous classification of user-generated texts. Leveraging state-of-the-art BERT-based language models, our approach surpasses human-level accuracy across diverse classification tasks, as measured by various metrics. Beyond academic contributions, our work can provide practical solutions for real-world hate speech and bias mitigation, contributing directly to the improvement of online community health. Our work provides a robust foundation for future research aiming to improve the quality of online discourse and foster societal well-being. All resources, including the dataset, source code, and trained models, will be publicly accessible upon the paper's acceptance. |
Dasol Choi · Jooyoung Song · Eunsun Lee · Seo Jin woo · HeeJune Park · Dongbin Na 🔗 |
-
|
An International Consortium for AI Risk Evaluations
(
Poster
)
>
link
Given rapid progress in AI and potential risks from next-generation frontier AI systems, the urgency to create and implement AI governance and regulatory schemes is apparent. A regulatory gap has permitted labs to conduct research, development, and deployment with minimal oversight and regulatory guidance. In response, frontier AI evaluations have been proposed as a way of assessing risks from the development and deployment of frontier AI systems. Yet, the budding AI risk evaluation ecosystem faces significant present and future coordination challenges, such as a limited diversity of evaluators, suboptimal allocation of effort, and races to the bottom. This paper proposes a solution in the form of an international consortium for AI risk evaluations, comprising both AI developers and third party AI risk evaluators. Such a consortium could play a critical role in international efforts to mitigate societal-scale risks from advanced AI. In this paper, we discuss the current evaluation ecosystem and its problems, introduce the proposed consortium, review existing organizations performing similar functions in other domains, and, finally, recommend concrete steps to advance the establishment of the proposed consortium. |
Ross Gruetzemacher · Alan Chan · Štěpán Los · Kevin Frazier · Simeon Campos · Matija Franklin · José Hernández-Orallo · James Fox · Christin Manning · Philip M Tomei · Kyle Kilian
|
-
|
Citation: A Key to Building Responsible and Accountable Large Language Models
(
Poster
)
>
link
Large Language Models (LLMs) bring transformative benefits alongside unique challenges, including intellectual property (IP) and ethical concerns. This position paper explores a novel angle to mitigate these risks, drawing parallels between LLMs and established web systems. We identify "citation" as a crucial yet missing component in LLMs, which could enhance content transparency and verifiability while addressing IP and ethical dilemmas. We further propose that a comprehensive citation mechanism for LLMs should account for both non-parametric and parametric content. Despite the complexity of implementing such a mechanism, along with the inherent potential pitfalls, we advocate for its development. Building on this foundation, we outline several research problems in this area, aiming to guide future explorations towards building more responsible and accountable LLMs. |
Jie Huang · Kevin Chang 🔗 |
-
|
Towards Optimal Statistical Watermarking
(
Spotlight
)
>
link
We study statistical watermarking by formulating it as a hypothesis testing problem, a general framework which subsumes all previous statistical watermarking methods. Key to our formulation is a coupling of the output tokens and the rejection region, realized by pseudo-random generators in practice, that allows non-trivial trade-off between the Type I error and Type II error. We characterize the Uniformly Most Powerful (UMP) watermark in this context. In the most common scenario where the output is a sequence of $n$ tokens, we establish matching upper and lower bounds on the number of i.i.d. tokens required to guarantee small Type I and Type II errors. Our rate scales as $\Theta(h^{-1} \log (1/h))$ with respect to the average entropy per token $h$ and thus greatly improves the $O(h^{-2})$ rate in the previous works. Moreover, we formulate the robust watermarking problem where user is allowed to perform a class of perturbation on the generated texts, and characterize the optimal type II error of robust UMP tests via a linear programming problem. To the best of our knowledge, this is the first systematic statistical treatment on the watermarking problem with near-optimal rates in the i.i.d. setting, and might be of interest for future works.
|
Baihe Huang · Banghua Zhu · Hanlin Zhu · Jason Lee · Jiantao Jiao · Michael Jordan 🔗 |
-
|
SuperHF: Supervised Iterative Learning from Human Feedback
(
Poster
)
>
link
While large language models demonstrate remarkable capabilities, they often present challenges in terms of safety, alignment with human values, and stability during training. Here, we focus on two prevalent methods used to align these models, Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF). SFT is simple and robust, powering a host of open-source models, while RLHF is a more sophisticated method used in top-tier models like ChatGPT but also suffers from instability and susceptibility to reward hacking. We propose a novel approach, Supervised Iterative Learning from Human Feedback (SuperHF), which seeks to leverage the strengths of both methods. Our hypothesis is two-fold: we posit that the reward model used in RLHF is critical for efficient data use and model generalization and that the use of Proximal Policy Optimization (PPO) in RLHF may not be necessary and could contribute to instability issues. SuperHF replaces PPO with a simple supervised loss and a Kullback-Leibler (KL) divergence prior. It creates its own training data by repeatedly sampling a batch of model outputs and filtering them through the reward model in an online learning regime. We then break down the reward optimization problem into three components: robustly optimizing the training rewards themselves, preventing reward hacking—or exploitation of the reward model that can degrade model performance—as measured by a novel METEOR similarity metric, and maintaining good performance on downstream evaluations. Our experimental results show SuperHF exceeds PPO-based RLHF on the training objective, easily and favorably trades off high reward with low reward hacking, improves downstream calibration, and performs the same on our GPT-4 based qualitative evaluation scheme all the while being significantly simpler to implement, highlighting SuperHF's potential as a competitive language model alignment technique. |
Gabe Mukobi · Peter Chatain · Su Fong · Robert Windesheim · Gitta Kutyniok · Kush Bhatia · Silas Alberti 🔗 |
-
|
Training Private and Efficient Language Models with Synthetic Data from LLMs
(
Poster
)
>
link
Language models are pivotal in modern text-based applications, offering many productivity features like next-word prediction, smart composition, and summarization. In many applications, these models must be lightweight to meet inference time and computational cost requirements. Furthermore, due to the inherent sensitivity of their training data, it is essential to train those models in a privacy-preserving manner. While it is well established that training large models with differential privacy (DP) leads to favorable utility-vs-privacy trade offs, training lightweight models with DP remains an open challenge.This paper explores the use of synthetic data generated from a DP fine-tuned large language model (LLM) to train lightweight models. The key insight behind our framework is that LLMs are better suited for private fine-tuning, and hence using the synthetic data is one way to transfer such capability to smaller models. Our framework can also be interpreted as doing {\em sampling based} Knowledge Distillation in DP setting. It's noteworthy that smaller models can be trained on synthetic data using non-private optimizers, thanks to the post-processing property of DP. We empirically demonstrate that our new approach significantly improves downstream performance compared to directly train lightweight models on real data with DP. For instance, using a model with just 4.4 million parameters, we achieve 97\% relative performance compared to the non-private counterparts in both medical and conversational corpus. |
Da Yu · Arturs Backurs · Sivakanth Gopi · Huseyin A. Inan · Janardhan Kulkarni · Zinan Lin · Chulin Xie · Huishuai Zhang · Wanrong Zhang 🔗 |
-
|
Towards a Situational Awareness Benchmark for LLMs
(
Spotlight
)
>
link
Among the facts that LLMs can learn is knowledge about themselves and their situation. This knowledge, and ability to make inferences based on it, is called situational awareness. Situationally aware models can be more helpful, but also pose risks. For example, situationally aware models could game testing setups by knowing they are being tested and acting differently. We create a new benchmark, SAD (Situational Awareness Dataset), for LLM situational awareness in two categories that are especially relevant for future AI risks. SAD-influence tests whether LLMs can accurately assess how they can or cannot influence the world. SAD-stages tests if LLMs can recognize if a particular input is likely to have come from a given stage of the LLM lifecycle (pretraining, supervised fine-tuning, testing, and deployment). Only the most capable models do better than chance. If the prompt tells the model that it is an LLM, scores increase by 9-21 percentage points for models on SAD-influence, while having mixed effects on SAD-stages. |
Rudolf Laine · Alexander Meinke · Owain Evans 🔗 |
-
|
Risk Assessment and Statistical Significance in the Age of Foundation Models
(
Poster
)
>
link
We propose a distributional framework for assessing socio-technical risks of foundation models with quantified statistical significance. Our approach hinges on a new statistical relative testing based on first and second order stochastic dominance of real random variables. We show that the second order statistics in this test are linked to mean-risk models commonly used in econometrics and mathematical finance to balance risk and utility when choosing between alternatives. Using this framework, we formally develop a risk-aware approach for foundation model selection given guardrails quantified by specified metrics. Inspired by portfolio optimization and selection theory in mathematical finance, we define a \emph{metrics portfolio} for each model as a means to aggregate a collection of metrics, and perform model selection based on the stochastic dominance of these portfolios. We use our framework to compare various large language models regarding risks related to drifting from instructions and outputting toxic content. |
Apoorva Nitsure · Youssef Mroueh · Mattia Rigotti · Kristjan Greenewald · Brian Belgodere · Mikhail Yurochkin · Jiri Navratil · Igor Melnyk · Jarret Ross 🔗 |
-
|
An Archival Perspective on Pretraining Data
(
Spotlight
)
>
link
Research in NLP on pretraining data has largely focused on identifying and mitigating downstream risks in models. However, we argue that more critical attention is needed to pretraining data and the systems that produce it. To consider a broader range of impacts of pretraining corpora, we consider the analogy between pretraining datasets and archives. In doing so, we surface impacts of these datasets beyond their role in directly shaping model behavior, including their existence as independent data artifacts, and the ways they can influence future practices. Within the broader ecosystem of datasets and models, we focus especially on processes involved in the creation of pretraining data. In particular, we explore research in NLP that relates to algorithmic filtering of pretraining data as a kind of appraisal. Using the lens of measurement, we critically examine the problem formulations taken on by this work. In doing so, we underscore how choices about what is included in pretraining data are necessarily choices about values. We conclude by drawing on archival studies to offer insights on paths forward. |
Meera Desai · Abigail Jacobs · Dallas Card 🔗 |
-
|
Bayesian low-rank adaptation for large language models
(
Spotlight
)
>
link
Low-rank adaptation (LoRA) has emerged as a new paradigm for cost-efficient fine-tuning of large language models (LLMs). However, fine-tuned LLMs often become overconfident especially when fine-tuned on small datasets. Bayesian methods, with their inherent ability to estimate uncertainty, serve as potent tools to mitigate overconfidence and enhance calibration. In this work, we introduce Laplace-LoRA, which applies a Bayesian approach to the LoRA parameters. Specifically, Laplace-LoRA applies a Laplace approximation to the posterior over the LoRA parameters, considerably improving the calibration of fine-tuned LLMs. |
Adam Yang · Maxime Robeyns · Xi Wang · Laurence Aitchison 🔗 |
-
|
A collection of principles for guiding and evaluating large language models
(
Poster
)
>
link
Large language models (LLMs) demonstrate outstanding capabilities, but challenges remain regarding their ability to solve complex reasoning tasks, as well as their transparency, robustness, truthfulness, and ethical alignment. In this preliminary study, we compile a set of core principles for steering and evaluating the reasoning of LLMs by curating literature from several relevant strands of work: structured reasoning in LLMs, self-evaluation/self-reflection, explainability, AI system safety/security, guidelines for human critical thinking, and ethical/regulatory guidelines for AI. We identify and curate a list of 220 principles from literature, and derive a set of 37 core principles organized into seven categories: assumptions and perspectives, reasoning, information and evidence, robustness and security, ethics, utility, and implications. We conduct a small-scale expert survey, eliciting the subjective importance experts assign to different principles and lay out avenues for future work beyond our preliminary results. We envision that the development of a shared model of principles can serve multiple purposes: monitoring and steering models at inference time, improving model behavior during training, and guiding human evaluation of model reasoning. |
Konstantin Hebenstreit · Robert Praas · Matthias Samwald 🔗 |
-
|
Are Models Biased on Text without Gender-related Language?
(
Poster
)
>
link
In the large language models (LLMs) era, it is imperative to measure and understand how gender biases present in the training data influence model behavior.Previous works construct benchmarks around known stereotypes (e.g., occupations) and demonstrate high levels of gender bias in LLMs, raising serious concerns about models exhibiting undesirable behaviors.We expand on existing literature by asking the question: \textit{Do large language models still favor one gender over the other in non-stereotypical settings?}To tackle this question, we restrict LLM evaluation to a \textit{neutral} subset, in which sentences are free of pronounced word-gender associations.After characterizing these associations in terms of pretraining data statistics, we use them to (1) create a new benchmark and (2) adapt popular gender pronoun benchmarks | Winobias and Winogender | removing strongly gender-correlated words.Surprisingly, when assessing $20+$ models in the proposed benchmarks, we still detect critically high gender bias across all tested models. For instance, after adjusting for strong word-gender associations, we find that all models still exhibit clear gender preferences in about $60\%$-$95\%$ of the sentences, representing a small change (up to $10\%$) from a \textit{stereotypical} setting.
|
Catarina Belém · Preethi Seshadri · Yasaman Razeghi · Sameer Singh 🔗 |
-
|
Linear Latent World Models in Simple Transformers: A Case Study on Othello-GPT
(
Poster
)
>
link
Foundation models exhibit significant capabilities in decision-making and logical deductions. Nonetheless, a continuing discourse persists regarding their genuine understanding of the world as opposed to mere stochastic mimicry. This paper meticulously examines a simple transformer trained for Othello, extending prior research to enhance comprehension of the emergent world model of Othello-GPT. The investigation reveals that Othello-GPT encapsulates a linear representation of opposing pieces, a factor that causally steers its decision-making process. This paper further elucidates the interplay between the linear world representation and causal decision-making, and their dependence on layer depth and model complexity. |
Zechen Zhang · Dean Hazineh · Jeffrey Chiu 🔗 |
-
|
The Empty Signifier Problem: Towards Clearer Paradigms for Operationalising "Alignment'' in Large Language Models
(
Poster
)
>
link
In this paper, we address the concept of ``alignment'' in large language models (LLMs) through the lens of post-structuralist socio-political theory, specifically examining its parallels to empty signifiers. To establish a shared vocabulary around how abstract concepts of alignment are operationalised in empirical datasets, we propose a framework that demarcates: 1) which dimensions of model behaviour are considered important, then 2) how meanings and definitions are ascribed to these dimensions, and by whom. We situate existing empirical literature and provide guidance on deciding which paradigm to follow. Through this framework, we aim to foster a culture of transparency and critical evaluation, aiding the community in navigating the complexities of aligning LLMs with human populations. |
Hannah Rose Kirk · Bertie Vidgen · Paul Rottger · Scott Hale 🔗 |
-
|
Understanding Hidden Context in Preference Learning: Consequences for RLHF
(
Poster
)
>
link
In practice, preference learning from human feedback depends on incomplete data with hidden context. Hidden context refers to data that affects the feedback received, but which is not represented in the data used to train a preference model. This captures common issues of data collection, such as having human annotators with varied preferences, cognitive processes that result in seemingly irrational behavior, and combining data labeled according to different criteria. We prove that standard applications of preference learning, including reinforcement learning from human feedback (RLHF), implicitly aggregate over hidden contexts according to a well-known voting rule called Borda count. We show this can produce counter-intuitive results that are very different from other methods which implicitly aggregate via expected utility. Furthermore, our analysis formalizes the way that preference learning from users with diverse values tacitly implements a social choice function. A key implication of this result is that annotators have an incentive to misreport their preferences in order to influence the learned model, leading to vulnerabilities in the deployment of RLHF. As a step towards mitigating these problems, we introduce a class of methods called distributional preference learning (DPL). DPL methods estimate a distribution of possible score values for each alternative in order to better account for hidden context. Experimental results indicate that applying DPL to RLHF for LLM chatbots identifies hidden context in the data and significantly reduces subsequent jailbreak vulnerability. |
Anand Siththaranajn · Cassidy Laidlaw · Dylan Hadfield-Menell 🔗 |
-
|
Subtle Misogyny Detection and Mitigation: An Expert-Annotated Dataset
(
Spotlight
)
>
link
Using novel approaches to dataset development, the Biasly dataset captures the nuance and subtlety of misogyny in ways that are unique within the literature. Built in collaboration with multi-disciplinary experts and annotators themselves, the dataset contains annotations of movie subtitles, capturing colloquial expressions of misogyny in North American film. The dataset can be used for a range of NLP tasks, including classification, severity score regression, and text generation for rewrites. In this paper, we discuss the methodology used, analyze the annotations obtained, and provide baselines using common NLP algorithms in the context of misogyny detection and mitigation. We hope this work will promote AI for social good in NLP for bias detection, explanation and removal. |
Anna Richter · Brooklyn Sheppard · Allison Cohen · Elizabeth Smith · Tamara Kneese · Carolyne Pelletier · Ioana Baldini · Yue Dong 🔗 |
-
|
Towards Publicly Accountable Frontier LLMs
(
Poster
)
>
link
With the increasing integration of frontier large language models (LLMs) into society and the economy, decisions related to their training, deployment, and use have far-reaching implications. These decisions should not be left solely in the hands of frontier LLM developers. LLM users, civil society and policymakers need trustworthy sources of information to steer such decisions for the better. Involving outside actors in the evaluation of these systems (external scrutiny) offers a solution: it can help provide information that is more accurate and complete. Despite encouraging signs of increasing external scrutiny of frontier LLMs, its success is not assured. In this paper, we survey six requirements for effective external scrutiny of frontier AI systems and organize them under the ASPIRE framework: Access, Searching attitude, Proportionality to the risks, Independence, Resources, and Expertise. We then illustrate how external scrutiny might function throughout the AI lifecycle. |
Markus Anderljung · Everett Smith · Joe O'Brien · Lisa Soder · Benjamin Bucknall · Emma Bluemke · Jonas Schuett · Robert Trager · Lacey Strahm · Rumman Chowdhury 🔗 |
-
|
Successor Heads: Recurring, Interpretable Attention Heads In The Wild
(
Poster
)
>
link
In this work we present successor heads: attention heads that increment tokens with a natural ordering, such as numbers, months, and days. For example, successor heads increment ‘Monday’ into ‘Tuesday’. We explain the successor head behavior with an approach rooted in mechanistic interpretability, the field that aims to explain how models complete tasks in human-understandable terms. Existing research in this area has found interpretable language model components in small toy models. However, results in toy models have not yet led to insights that explain the internals of frontier models and little is currently understood about the internal operations of large language models. In this paper, we analyze the behavior of successor heads in large language models (LLMs) and find that they implement abstract representations that are common to different architectures. They form in LLMs with as few as 31 million parameters, and at least as many as 12 billion parameters, such as GPT-2, Pythia, and Llama-2. We find a set of ‘mod 10’ features that underlie how successor heads increment in LLMs across different architectures and sizes. We perform vector arithmetic with these features to edit head behavior and provide insights into numeric representations within LLMs. Additionally, we study the behavior of successor heads on natural language data, identifying interpretable polysemanticity in a Pythia successor head. |
Rhys Gould · Euan Ong · George Ogden · Arthur Conmy 🔗 |
-
|
Forbidden Facts: An Investigation of Competing Objectives in Llama 2
(
Poster
)
>
link
LLMs often face competing pressures (for example helpfulness vs. harmlessness). To understand how models resolve such conflicts, we study Llama-2-7b-chat on the \textit{forbidden fact} task. Specifically, we instruct Llama 2 to truthfully complete a factual recall statement while forbidding it from saying the correct answer. This often makes the model give incorrect answers. We decompose Llama 2 into 1057 different components, and rank each one with respect to how useful it is for forbidding the correct answer. We find that in aggregate, 41 components are enough to reliably implement the full suppression behavior. However, we find that these components are fairly heterogeneous and that many operate using faulty heuristics. We find that one of these heuristics can be exploited via manually designed adversarial attacks, which we call California Attacks. Our results highlight some roadblocks standing in the way of being able to successfully interpret advanced ML systems. |
Tony Wang · Miles Wang · Kaivalya Hariharan · Nir Shavit 🔗 |