NeurIPS MoPe: Model Perturbation-based Privacy Attacks on Language Models

Poster
in
Workshop: Socially Responsible Language Modelling Research (SoLaR)

MoPe: Model Perturbation-based Privacy Attacks on Language Models

Jason Wang · Jeffrey Wang · Marvin Li · Seth Neel

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract: Recent work has shown that Large Language Models (LLMs) can unintentionally leak sensitive information present in their training data. In this paper, we present MoPe (

$\textbf{Mo}$ del

$\textbf{Pe}$ rturbations), a new method to identify with high confidence if a given text is in the training data of a pre-trained language model, given white-box access to the models parameters. MoPe adds noise to the model in parameter space and measures the drop in the log-likelihood for a given point

$x$ , a statistic we show approximates the trace of the Hessian matrix with respect to model parameters. We compare MoPe to existing state-of-the-art loss-based attacks and other attacks based on second-order curvature information (such as the trace of the Hessian with respect to the model input). Across language models ranging from size

$70$ M to

$12$ B parameters, we show that MoPe is more effective than existing attacks. We also find that the loss of a point alone is insufficient to determine extractability---there are training points we can recover using our methods that have average loss. This casts some doubt on prior work that uses the loss of a point as evidence of memorization or "unlearning."

Chat is not available.

Poster in Workshop: Socially Responsible Language Modelling Research (SoLaR)

MoPe: Model Perturbation-based Privacy Attacks on Language Models

Jason Wang · Jeffrey Wang · Marvin Li · Seth Neel

Poster
in
Workshop: Socially Responsible Language Modelling Research (SoLaR)