Timezone: »

Near-Optimal Time and Sample Complexities for Solving Markov Decision Processes with a Generative Model
Aaron Sidford · Mengdi Wang · Xian Wu · Lin Yang · Yinyu Ye

Wed Dec 05 07:45 AM -- 09:45 AM (PST) @ Room 517 AB #168
In this paper we consider the problem of computing an $\epsilon$-optimal policy of a discounted Markov Decision Process (DMDP) provided we can only access its transition function through a generative sampling model that given any state-action pair samples from the transition function in $O(1)$ time. Given such a DMDP with states $\states$, actions $\actions$, discount factor $\gamma\in(0,1)$, and rewards in range $[0, 1]$ we provide an algorithm which computes an $\epsilon$-optimal policy with probability $1 - \delta$ where {\it both} the run time spent and number of sample taken is upper bounded by \[ O\left[\frac{|\cS||\cA|}{(1-\gamma)^3 \epsilon^2} \log \left(\frac{|\cS||\cA|}{(1-\gamma)\delta \epsilon} \right) \log\left(\frac{1}{(1-\gamma)\epsilon}\right)\right] ~. \] For fixed values of $\epsilon$, this improves upon the previous best known bounds by a factor of $(1 - \gamma)^{-1}$ and matches the sample complexity lower bounds proved in \cite{azar2013minimax} up to logarithmic factors. We also extend our method to computing $\epsilon$-optimal policies for finite-horizon MDP with a generative model and provide a nearly matching sample complexity lower bound.

Author Information

Aaron Sidford (Stanford)
Mengdi Wang (Princeton University)

Mengdi Wang is interested in data-driven stochastic optimization and applications in machine and reinforcement learning. She received her PhD in Electrical Engineering and Computer Science from Massachusetts Institute of Technology in 2013. At MIT, Mengdi was affiliated with the Laboratory for Information and Decision Systems and was advised by Dimitri P. Bertsekas. Mengdi became an assistant professor at Princeton in 2014. She received the Young Researcher Prize in Continuous Optimization of the Mathematical Optimization Society in 2016 (awarded once every three years).

Xian Wu (Stanford University)
Lin Yang (Princeton University)
Yinyu Ye (Standord)

More from the Same Authors