Timezone: »
Supervised Q-Learning for Continuous Control
Hao Sun · Ziping Xu · Taiyi Wang · Meng Fang · Bolei Zhou
Event URL: https://openreview.net/forum?id=fBDS6Xr__mg »
Policy gradient (PG) algorithms have been widely used in reinforcement learning (RL). However, PG algorithms rely on exploiting the value function being learned with the first-order update locally, which results in limited sample efficiency. In this work, we propose an alternative method called Zeroth-Order Supervised Policy Improvement (ZOSPI). ZOSPI exploits the estimated value function $Q$ globally while preserving the local exploitation of the PG methods based on zeroth-order policy optimization. This learning paradigm follows Q-learning but overcomes the difficulty of efficiently operating argmax in continuous action space. It finds max-valued action within a small number of samples. The policy learning of ZOSPI has two steps: First, it samples actions and evaluates those actions with a learned value estimator, and then it learns to perform the action with the highest value through supervised learning. We further demonstrate such a supervised learning framework can learn multi-modal policies. Experiments show that ZOSPI achieves competitive results on the continuous control benchmarks with a remarkable sample efficiency.
Policy gradient (PG) algorithms have been widely used in reinforcement learning (RL). However, PG algorithms rely on exploiting the value function being learned with the first-order update locally, which results in limited sample efficiency. In this work, we propose an alternative method called Zeroth-Order Supervised Policy Improvement (ZOSPI). ZOSPI exploits the estimated value function $Q$ globally while preserving the local exploitation of the PG methods based on zeroth-order policy optimization. This learning paradigm follows Q-learning but overcomes the difficulty of efficiently operating argmax in continuous action space. It finds max-valued action within a small number of samples. The policy learning of ZOSPI has two steps: First, it samples actions and evaluates those actions with a learned value estimator, and then it learns to perform the action with the highest value through supervised learning. We further demonstrate such a supervised learning framework can learn multi-modal policies. Experiments show that ZOSPI achieves competitive results on the continuous control benchmarks with a remarkable sample efficiency.
Author Information
Hao Sun (University of Cambridge)
Ziping Xu (University of Michigan)
My name is Ziping Xu. I am a fifth-year Ph.D. student in Statistics at the University of Michigan. My research interests are on sample efficient reinforcement learning and transfer learning, multitask learning. I am looking for research-orientated full-time job starting Fall 2023
Taiyi Wang (University of Cambridge)
Meng Fang (Tencent)
Bolei Zhou (UCLA)

Assistant professor at UCLA's computer science department
More from the Same Authors
-
2021 Spotlight: Representation Learning Beyond Linear Prediction Functions »
Ziping Xu · Ambuj Tewari -
2021 : MHER: Model-based Hindsight Experience Replay »
Yang Rui · Meng Fang · Lei Han · Yali Du · Feng Luo · Xiu Li -
2022 : ChemSpacE: Interpretable and Interactive Chemical Space Exploration »
Yuanqi Du · Xian Liu · Nilay Shah · Shengchao Liu · Jieyu Zhang · Bolei Zhou -
2022 : Constrained MDPs can be Solved by Eearly-Termination with Recurrent Models »
Hao Sun · Ziping Xu · Meng Fang · Zhenghao Peng · Taiyi Wang · Bolei Zhou -
2022 : Supervised Q-Learning can be a Strong Baseline for Continuous Control »
Hao Sun · Ziping Xu · Taiyi Wang · Meng Fang · Bolei Zhou -
2022 : GraphCG: Unsupervised Discovery of Steerable Factors in Graphs »
Shengchao Liu · Chengpeng Wang · Weili Nie · Hanchen Wang · Jiarui Lu · Bolei Zhou · Jian Tang -
2022 : MOPA: a Minimalist Off-Policy Approach to Safe-RL »
Hao Sun · Ziping Xu · Zhenghao Peng · Meng Fang · Bo Dai · Bolei Zhou -
2022 : Novel Policy Seeking with Constrained Optimization »
Hao Sun · Zhenghao Peng · Bolei Zhou -
2022 : Toward Causal-Aware RL: State-Wise Action-Refined Temporal Difference »
Hao Sun · Taiyi Wang -
2022 : Toward Generalizable Embodied AI for Autonomous Driving »
Bolei Zhou -
2022 Poster: Human-AI Shared Control via Policy Dissection »
Quanyi Li · Zhenghao Peng · Haibin Wu · Lan Feng · Bolei Zhou -
2022 Poster: Exploit Reward Shifting in Value-Based Deep-RL: Optimistic Curiosity-Based Exploration and Conservative Exploitation via Linear Reward Shaping »
Hao Sun · Lei Han · Rui Yang · Xiaoteng Ma · Jian Guo · Bolei Zhou -
2022 Poster: Improving GANs with A Dynamic Discriminator »
Ceyuan Yang · Yujun Shen · Yinghao Xu · Deli Zhao · Bo Dai · Bolei Zhou -
2022 Poster: Adaptive Sampling for Discovery »
Ziping Xu · Eunjae Shim · Ambuj Tewari · Paul Zimmerman -
2021 Poster: Representation Learning Beyond Linear Prediction Functions »
Ziping Xu · Ambuj Tewari -
2020 Poster: TorsionNet: A Reinforcement Learning Approach to Sequential Conformer Search »
Tarun Gogineni · Ziping Xu · Exequiel Punzalan · Runxuan Jiang · Joshua Kammeraad · Ambuj Tewari · Paul Zimmerman -
2020 Poster: Deep Reinforcement Learning with Stacked Hierarchical Attention for Text-based Games »
Yunqiu Xu · Meng Fang · Ling Chen · Yali Du · Joey Tianyi Zhou · Chengqi Zhang -
2020 Poster: Reinforcement Learning in Factored MDPs: Oracle-Efficient Algorithms and Tighter Regret Bounds for the Non-Episodic Setting »
Ziping Xu · Ambuj Tewari -
2019 : Poster and Coffee Break 1 »
Aaron Sidford · Aditya Mahajan · Alejandro Ribeiro · Alex Lewandowski · Ali H Sayed · Ambuj Tewari · Angelika Steger · Anima Anandkumar · Asier Mujika · Hilbert J Kappen · Bolei Zhou · Byron Boots · Chelsea Finn · Chen-Yu Wei · Chi Jin · Ching-An Cheng · Christina Yu · Clement Gehring · Craig Boutilier · Dahua Lin · Daniel McNamee · Daniel Russo · David Brandfonbrener · Denny Zhou · Devesh Jha · Diego Romeres · Doina Precup · Dominik Thalmeier · Eduard Gorbunov · Elad Hazan · Elena Smirnova · Elvis Dohmatob · Emma Brunskill · Enrique Munoz de Cote · Ethan Waldie · Florian Meier · Florian Schaefer · Ge Liu · Gergely Neu · Haim Kaplan · Hao Sun · Hengshuai Yao · Jalaj Bhandari · James A Preiss · Jayakumar Subramanian · Jiajin Li · Jieping Ye · Jimmy Smith · Joan Bas Serrano · Joan Bruna · John Langford · Jonathan Lee · Jose A. Arjona-Medina · Kaiqing Zhang · Karan Singh · Yuping Luo · Zafarali Ahmed · Zaiwei Chen · Zhaoran Wang · Zhizhong Li · Zhuoran Yang · Ziping Xu · Ziyang Tang · Yi Mao · David Brandfonbrener · Shirli Di-Castro · Riashat Islam · Zuyue Fu · Abhishek Naik · Saurabh Kumar · Benjamin Petit · Angeliki Kamoutsi · Simone Totaro · Arvind Raghunathan · Rui Wu · Donghwan Lee · Dongsheng Ding · Alec Koppel · Hao Sun · Christian Tjandraatmadja · Mahdi Karami · Jincheng Mei · Chenjun Xiao · Junfeng Wen · Zichen Zhang · Ross Goroshin · Mohammad Pezeshki · Jiaqi Zhai · Philip Amortila · Shuo Huang · Mariya Vasileva · El houcine Bergou · Adel Ahmadyan · Haoran Sun · Sheng Zhang · Lukas Gruber · Yuanhao Wang · Tetiana Parshakova -
2019 : Poster Session »
Ahana Ghosh · Javad Shafiee · Akhilan Boopathy · Alex Tamkin · Theodoros Vasiloudis · Vedant Nanda · Ali Baheri · Paul Fieguth · Andrew Bennett · Guanya Shi · Hao Liu · Arushi Jain · Jacob Tyo · Benjie Wang · Boxiao Chen · Carroll Wainwright · Chandramouli Shama Sastry · Chao Tang · Daniel S. Brown · David Inouye · David Venuto · Dhruv Ramani · Dimitrios Diochnos · Divyam Madaan · Dmitrii Krashenikov · Joel Oren · Doyup Lee · Eleanor Quint · elmira amirloo · Matteo Pirotta · Gavin Hartnett · Geoffroy Dubourg-Felonneau · Gokul Swamy · Pin-Yu Chen · Ilija Bogunovic · Jason Carter · Javier Garcia-Barcos · Jeet Mohapatra · Jesse Zhang · Jian Qian · John Martin · Oliver Richter · Federico Zaiter · Tsui-Wei Weng · Karthik Abinav Sankararaman · Kyriakos Polymenakos · Lan Hoang · mahdieh abbasi · Marco Gallieri · Mathieu Seurin · Matteo Papini · Matteo Turchetta · Matthew Sotoudeh · Mehrdad Hosseinzadeh · Nathan Fulton · Masatoshi Uehara · Niranjani Prasad · Oana-Maria Camburu · Patrik Kolaric · Philipp Renz · Prateek Jaiswal · Reazul Hasan Russel · Riashat Islam · Rishabh Agarwal · Alexander Aldrick · Sachin Vernekar · Sahin Lale · Sai Kiran Narayanaswami · Samuel Daulton · Sanjam Garg · Sebastian East · Shun Zhang · Soheil Dsidbari · Justin Goodwin · Victoria Krakovna · Wenhao Luo · Wesley Chung · Yuanyuan Shi · Yuh-Shyang Wang · Hongwei Jin · Ziping Xu -
2019 Poster: Curriculum-guided Hindsight Experience Replay »
Meng Fang · Tianyi Zhou · Yali Du · Lei Han · Zhengyou Zhang -
2019 Poster: Policy Continuation with Hindsight Inverse Dynamics »
Hao Sun · Zhizhong Li · Xiaotong Liu · Bolei Zhou · Dahua Lin -
2019 Spotlight: Policy Continuation with Hindsight Inverse Dynamics »
Hao Sun · Zhizhong Li · Xiaotong Liu · Bolei Zhou · Dahua Lin -
2019 Poster: LIIR: Learning Individual Intrinsic Reward in Multi-Agent Reinforcement Learning »
Yali Du · Lei Han · Meng Fang · Ji Liu · Tianhong Dai · Dacheng Tao