Timezone: »
Poster
Large Scale Markov Decision Processes with Changing Rewards
Adrian Rivera Cardoso · He Wang · Huan Xu
Wed Dec 11 05:00 PM -- 07:00 PM (PST) @ East Exhibition Hall B + C #52
We consider Markov Decision Processes (MDPs) where the rewards are unknown and may change in an adversarial manner. We provide an algorithm that achieves a regret bound of $O( \sqrt{\tau (\ln|S|+\ln|A|)T}\ln(T))$, where $S$ is the state space, $A$ is the action space, $\tau$ is the mixing time of the MDP, and $T$ is the number of periods. The algorithm's computational complexity is polynomial in $|S|$ and $|A|$. We then consider a setting often encountered in practice, where the state space of the MDP is too large to allow for exact solutions. By approximating the state-action occupancy measures with a linear architecture of dimension $d\ll|S|$, we propose a modified algorithm with a computational complexity polynomial in $d$ and independent of $|S|$. We also prove a regret bound for this modified algorithm, which to the best of our knowledge, is the first $\tilde{O}(\sqrt{T})$ regret bound in the large-scale MDP setting with adversarially changing rewards.
Author Information
Adrian Rivera Cardoso (Georgia Tech)
He Wang (Georgia Institute of Technology)
Huan Xu (Georgia Inst. of Technology)
More from the Same Authors
-
2019 Poster: Value Propagation for Decentralized Networked Deep Multi-agent Reinforcement Learning »
Chao Qu · Shie Mannor · Huan Xu · Yuan Qi · Le Song · Junwu Xiong -
2018 Poster: Robust Hypothesis Testing Using Wasserstein Uncertainty Sets »
Rui Gao · Liyan Xie · Yao Xie · Huan Xu -
2018 Spotlight: Robust Hypothesis Testing Using Wasserstein Uncertainty Sets »
Rui Gao · Liyan Xie · Yao Xie · Huan Xu -
2017 Poster: Reinforcement Learning under Model Mismatch »
Aurko Roy · Huan Xu · Sebastian Pokutta