Timezone: »

CO-PILOT: COllaborative Planning and reInforcement Learning On sub-Task curriculum
Shuang Ao · Tianyi Zhou · Guodong Long · Qinghua Lu · Liming Zhu · Jing Jiang

Wed Dec 08 04:30 PM -- 06:00 PM (PST) @ Virtual

Goal-conditioned reinforcement learning (RL) usually suffers from sparse reward and inefficient exploration in long-horizon tasks. Planning can find the shortest path to a distant goal that provides dense reward/guidance but is inaccurate without a precise environment model. We show that RL and planning can collaboratively learn from each other to overcome their own drawbacks. In ''CO-PILOT'', a learnable path-planner and an RL agent produce dense feedback to train each other on a curriculum of tree-structured sub-tasks. Firstly, the planner recursively decomposes a long-horizon task to a tree of sub-tasks in a top-down manner, whose layers construct coarse-to-fine sub-task sequences as plans to complete the original task. The planning policy is trained to minimize the RL agent's cost of completing the sequence in each layer from top to bottom layers, which gradually increases the sub-tasks and thus forms an easy-to-hard curriculum for the planner. Next, a bottom-up traversal of the tree trains the RL agent from easier sub-tasks with denser rewards on bottom layers to harder ones on top layers and collects its cost on each sub-task train the planner in the next episode. CO-PILOT repeats this mutual training for multiple episodes before switching to a new task, so the RL agent and planner are fully optimized to facilitate each other's training. We compare CO-PILOT with RL (SAC, HER, PPO), planning (RRT*, NEXT, SGT), and their combination (SoRB) on navigation and continuous control tasks. CO-PILOT significantly improves the success rate and sample efficiency.

Author Information

Shuang Ao (University of Technology Sydney)
Tianyi Zhou (University of Washington, Seattle)

Tianyi Zhou is a Ph.D. student in Computer Science at University of Washington and a member of MELODI lab led by Prof. Jeff A. Bilmes. He will be joining University of Maryland, College Park as a tenure-track assistant professor at the Department of Computer Science and affiliated with UMIACS in 2022. His research interests are in machine learning, optimization, and natural language processing. He has published ~60 papers at NeurIPS, ICML, ICLR, AISTATS, EMNLP, NAACL, COLING, KDD, ICDM, AAAI, IJCAI, ISIT, Machine Learning (Springer), IEEE TIP/TNNLS/TKDE, etc. He is the recipient of the Best Student Paper Award at ICDM 2013 and the 2020 IEEE TCSC Most Influential Paper Award.

Guodong Long (University of Technology Sydney (UTS))
Qinghua Lu (Data61, CSIRO)
Liming Zhu (CSIRO)
Jing Jiang (University of Technology Sydney)

More from the Same Authors