Poster

Aligning to Thousands of Preferences via System Message Generalization

Seongyun Lee ⋅ Sue Hyun Park ⋅ Seungone Kim ⋅ Minjoon Seo

2024 Poster

Project Page [ Paper] [ Poster] [ OpenReview]

Abstract

Although humans inherently have diverse values, current large language model (LLM) alignment methods often assume that aligning LLMs with the general public’s preferences is optimal. A major challenge in adopting a more individualized approach to LLM alignment is its lack of scalability, as it involves repeatedly acquiring preference data and training new reward models and LLMs for each individual’s preferences. To address these challenges, we propose a new paradigm where users specify what they value most within the system message, steering the LLM’s generation behavior to better align with the user’s intentions. However, a naive application of such an approach is non-trivial since LLMs are typically trained on a uniform system message (e.g., “You are a helpful assistant”), which limitstheir ability to generalize to diverse, unseen system messages. To improve this generalization, we create Multifaceted Collection, augmenting 66k user instructions into 197k system messages through hierarchical user value combinations. Using this dataset, we train a 7B LLM called Janus and test it on 921 prompts from 5 benchmarks (AlpacaEval 2.0, FLASK, Koala, MT-Bench, and Self-Instruct)by adding system messages that reflect unseen user values. JANUS achieves tie+win rate of 75.2%, 72.4%, and 66.4% against Mistral 7B Instruct v0.2, GPT-3.5 Turbo, and GPT-4, respectively. Unexpectedly, on three benchmarks focused on response helpfulness (AlpacaEval 2.0, MT-Bench, Arena Hard Auto v0.1), JANUS also outperforms LLaMA 3 8B Instruct by a +4.0%p, +0.1%p, +3.0%p margin, underscoring that training with a vast array of system messages could also enhance alignment to the general public’s preference as well. Our code, dataset, benchmark, and models are available at https://lklab.kaist.ac.kr/Janus/.

Video

Chat is not available.