Poster
in
Workshop: Workshop on Behavioral Machine Learning

Optimizing Reward Models with Proximal Policy Exploration in Preference-Based Reinforcement Learning

Yiwen Zhu ⋅ Jinyi Liu ⋅ Yifu Yuan ⋅ Wenya Wei ⋅ Zhenxing Ge ⋅ qianyi fu ⋅ Zhou Fang ⋅ Yujing Hu ⋅ Bo An

Project Page [ OpenReview]

Abstract

Traditional reinforcement learning (RL) relies on carefully designed reward functions, which are challenging to implement for complex behaviors and may introduce biases in real-world applications. Preference-based RL (PbRL) offers a promising alternative by using human feedback, yet its extensive demand for human input constrains scalability. To address that, this paper proposes a proximal policy exploration algorithm (PPE), designed to enhance the efficiency of human feedback by concentrating on near-policy regions. By incorporating a policy-aligned query mechanism, our approach not only increases the accuracy of the reward model but also reduces the need for extensive human interaction. Our results demonstrate that improving the reward model's evaluative precision in near-policy regions enhances policy optimization reliability, ultimately boosting overall performance. Furthermore, our comprehensive experiments show that actively encouraging diversity in feedback substantially improves human feedback efficiency.

Chat is not available.