Skip to yearly menu bar Skip to main content

Workshop: AI meets Moral Philosophy and Moral Psychology: An Interdisciplinary Dialogue about Computational Ethics

#28: Canonical Design for Language Agents using Natural Language Reward Models

Silviu Pitis · Ziang Xiao · Alessandro Sordoni

Keywords: [ value alignment ] [ reward modeling ] [ Large language models ] [ rlhf ]

[ ] [ Project Page ]
Fri 15 Dec 12:50 p.m. PST — 1:50 p.m. PST


While finetuning language models (LMs) using a reward model learned from pairwise preferences has proven remarkably successful, this approach has several critical shortcomings. Direct preference feedback is uninterpretable, difficult to provide for complex objects, and often inconsistent, either because it is based on underspecified instructions or provided by principals with differing values. To address these challenges, we propose a decomposed reward modeling framework that uses a natural language canon---a body of conditionally applicable, law-like principles that govern agent behavior---to generate natural language reward models (NLRMs). The construction and application of such a canon poses several interesting questions. In this preliminary work, we outline the framework, discuss its design goals, and highlight potentially fruitful research directions. Additionally, we conduct a preliminary empirical investigation into the formulation, effectiveness, and composition of LM-evaluated NLRMs. We find that different NLRM formats differ significantly in performance, but that the interpretations of similarly formatted NLRMs by a standard LM are highly correlated even when the NLRMs represent different principles. This suggests significant room for improving both the design and evaluation of our initial NLRMs.

Chat is not available.