Skip to yearly menu bar Skip to main content


Poster

Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data

Johannes Treutlein · Dami Choi · Jan Betley · Cem Anil · Samuel Marks · Roger Grosse · Owain Evans


Abstract: One way to address the risks from large language models (LLMs) is to censor dangerous knowledge from their training data. While this removes the explicit information, it still leaves implicit hints scattered across the training documents. Could an LLM infer the dangerous knowledge by piecing together these hints? As a step towards answering this question, we study \textit{inductive out-of-context reasoning} (OOCR). This is a type of generalization in which LLMs infer latent information by aggregating training data and apply this to downstream tasks without in-context learning. Using a suite of five tasks, we demonstrate that frontier LLMs can succeed at inductive OOCR. In one experiment we finetune an LLM on a corpus consisting only of distances between an unknown city and other known cities. Remarkably, without in-context examples or Chain of Thought, the LLM can verbalize that the unknown city is Paris and leverage this fact for other queries. Additional experiments show that LLMs finetuned on outcomes of individual coin flips can verbalize whether the coin is biased, and those finetuned on pairs $(x,f(x))$ can articulate a definition of $f$.The surprising ability of LLMs to "connect the dots" without explicit in-context learning poses a potential obstacle to monitoring and controlling the knowledge acquired by LLMs.

Live content is unavailable. Log in and register to view live content