Timezone: »

Supervised Q-Learning can be a Strong Baseline for Continuous Control
Hao Sun · Ziping Xu · Taiyi Wang · Meng Fang · Bolei Zhou
Event URL: https://openreview.net/forum?id=R9jakCHb_1C »
Policy gradient (PG) algorithms have been widely used in reinforcement learning (RL). However, PG algorithms rely on exploiting the value function being learned with the first-order update locally, which results in limited sample efficiency. In this work, we propose an alternative method called Zeroth-Order Supervised Policy Improvement (ZOSPI). ZOSPI exploits the estimated value function $Q$ globally while preserving the local exploitation of the PG methods based on zeroth-order policy optimization. This learning paradigm follows Q-learning but overcomes the difficulty of efficiently operating argmax in continuous action space. It finds max-valued action within a small number of samples. The policy learning of ZOSPI has two steps: First, it samples actions and evaluates those actions with a learned value estimator, and then it learns to perform the action with the highest value through supervised learning. We further demonstrate such a supervised learning framework can learn multi-modal policies. Experiments show that ZOSPI achieves competitive results on the continuous control benchmarks with a remarkable sample efficiency.

Author Information

Hao Sun (University of Cambridge)
Ziping Xu (University of Michigan)

My name is Ziping Xu. I am a fifth-year Ph.D. student in Statistics at the University of Michigan. My research interests are on sample efficient reinforcement learning and transfer learning, multitask learning. I am looking for research-orientated full-time job starting Fall 2023

Taiyi Wang (University of Cambridge)
Meng Fang (Tencent)
Bolei Zhou (UCLA)
Bolei Zhou

Assistant professor at UCLA's computer science department

More from the Same Authors