NeurIPS Benchmark Probing: Investigating Data Leakage in Large Language Models

Poster
in
Workshop: Backdoors in Deep Learning: The Good, the Bad, and the Ugly

Benchmark Probing: Investigating Data Leakage in Large Language Models

Chunyuan Deng · Yilun Zhao · Xiangru Tang · Mark Gerstein · Arman Cohan

[ Abstract ] [ Project Page ]

[ Poster] [ OpenReview]

Abstract: Large language models have consistently demonstrated exceptional performance across a wide range of natural language processing tasks. However, concerns have been raised about whether LLMs rely on benchmark data during their training phase, potentially leading to inflated scores on these benchmarks. This phenomenon, known as data contamination, presents a significant challenge within the context of LLMs. In this paper, we present a novel investigation protocol named $\textbf{T}$estset $\textbf{S}$lot Guessing ($\textbf{TS-Guessing}$) on knowledge-required benchmark MMLU and TruthfulQA, designed to estimate the contamination of emerging commercial LLMs. We divide this protocol into two subtasks: (i) $\textit{Question-based}$ setting: guessing the missing portions for long and complex questions in the testset (ii) $\textit{Question-Multichoice}$ setting: guessing the missing option given both complicated questions and options. We find that commercial LLMs could surprisingly fill in the absent data and demonstrate a remarkable increase given additional metadata (from 22.28\% to 42.19\% for Claude-instant-1 and from 17.53\% to 29.49\% for GPT-4).

Chat is not available.

Poster in Workshop: Backdoors in Deep Learning: The Good, the Bad, and the Ugly

Benchmark Probing: Investigating Data Leakage in Large Language Models

Chunyuan Deng · Yilun Zhao · Xiangru Tang · Mark Gerstein · Arman Cohan

Poster
in
Workshop: Backdoors in Deep Learning: The Good, the Bad, and the Ugly