Value Entanglement: Conflation Between Moral and Grammatical Good In (Some) Large Language Models
Seong Hah Cho · Junyi Li · Anna Leshinskaya
Abstract
Empirical inquiry of the acquired representation of value in pre-trained Large Language Models (LLMs) is an important step towards value alignment. Here we query whether LLMs distinguish different kinds of good: the moral good vs grammatical good of the same sentences. By probing behavior, embeddings, and activation vectors, we report that some LLMs exhibit value entanglement: we report that the representation of grammaticality is overly influenced by moral value relative to human norms. This conflation was repaired by selective ablation of the activation vector associated with morality.
Chat is not available.
Successful Page Load