Skip to yearly menu bar Skip to main content

Workshop: Socially Responsible Language Modelling Research (SoLaR)

MoPe: Model Perturbation-based Privacy Attacks on Language Models

Jason Wang · Jeffrey Wang · Marvin Li · Seth Neel

Abstract: Recent work has shown that Large Language Models (LLMs) can unintentionally leak sensitive information present in their training data. In this paper, we present MoPe ($\textbf{Mo}$del $\textbf{Pe}$rturbations), a new method to identify with high confidence if a given text is in the training data of a pre-trained language model, given white-box access to the models parameters. MoPe adds noise to the model in parameter space and measures the drop in the log-likelihood for a given point $x$, a statistic we show approximates the trace of the Hessian matrix with respect to model parameters. We compare MoPe to existing state-of-the-art loss-based attacks and other attacks based on second-order curvature information (such as the trace of the Hessian with respect to the model input). Across language models ranging from size $70$M to $12$B parameters, we show that MoPe is more effective than existing attacks. We also find that the loss of a point alone is insufficient to determine extractability---there are training points we can recover using our methods that have average loss. This casts some doubt on prior work that uses the loss of a point as evidence of memorization or "unlearning."

Chat is not available.