Poster
in
Workshop: OPT 2025: Optimization for Machine Learning

Stackelberg Learning from Human Feedback: Preference Optimization as a Sequential Game

Barna Pásztor ⋅ Thomas Kleine Buening ⋅ Andreas Krause

Project Page [ OpenReview]

Abstract

We propose Stackelberg Learning from Human Feedback (SLHF), a new framework for preference optimization. SLHF frames the alignment problem as a sequential-move game between two policies: a Leader, which commits to an action, and a Follower, which responds conditionally on the Leader's action. This formulation departs from prior approaches such as Reinforcement Learning from Human Feedback (RLHF), which rely real-valued reward models, and Nash Learning from Human Feedback (NLHF), which seek to compute a Nash equilibrium. The sequential structure of SLHF naturally enables test-time improvement, as the Follower learns to refine the Leader’s actions. We compare the solution concepts of SLHF, RLHF and NLHF, and lay out key advantages in consistency, data sensitivity, and robustness to intransitive preferences. Our experiments demonstrate that SLHF effectively aligns large language models with diverse, potentially intransitive, human preferences, and its test-time improvement generalizes across models without further training.

Chat is not available.