Timezone: »
We study reinforcement learning under model misspecification, where we do not have access to the true environment but only to a reasonably close approximation to it. We address this problem by extending the framework of robust MDPs to the model-free Reinforcement Learning setting, where we do not have access to the model parameters, but can only sample states from it. We define robust versions of Q-learning, Sarsa, and TD-learning and prove convergence to an approximately optimal robust policy and approximate value function respectively. We scale up the robust algorithms to large MDPs via function approximation and prove convergence under two different settings. We prove convergence of robust approximate policy iteration and robust approximate value iteration for linear architectures (under mild assumptions). We also define a robust loss function, the mean squared robust projected Bellman error and give stochastic gradient descent algorithms that are guaranteed to converge to a local minimum.
Author Information
Aurko Roy (Google)
Huan Xu (Georgia Inst. of Technology)
Sebastian Pokutta (Georgia Institute of Technology)
More from the Same Authors
-
2019 Poster: Large Scale Markov Decision Processes with Changing Rewards »
Adrian Rivera Cardoso · He Wang · Huan Xu -
2019 Poster: Value Propagation for Decentralized Networked Deep Multi-agent Reinforcement Learning »
Chao Qu · Shie Mannor · Huan Xu · Yuan Qi · Le Song · Junwu Xiong -
2019 Poster: Blended Matching Pursuit »
Cyrille Combettes · Sebastian Pokutta -
2018 Poster: Robust Hypothesis Testing Using Wasserstein Uncertainty Sets »
Rui Gao · Liyan Xie · Yao Xie · Huan Xu -
2018 Spotlight: Robust Hypothesis Testing Using Wasserstein Uncertainty Sets »
Rui Gao · Liyan Xie · Yao Xie · Huan Xu -
2016 Poster: Hierarchical Clustering via Spreading Metrics »
Aurko Roy · Sebastian Pokutta -
2016 Oral: Hierarchical Clustering via Spreading Metrics »
Aurko Roy · Sebastian Pokutta