Poster
Improved and Generalized Upper Bounds on the Complexity of Policy Iteration
Bruno Scherrer
Harrah's Special Events Center, 2nd Floor
[
Abstract
]
Abstract:
Given a Markov Decision Process (MDP) with n states and m actions per state, we study the number of iterations needed by Policy Iteration (PI) algorithms to converge to the optimal γ-discounted optimal policy. We consider two variations of PI: Howard's PI that changes the actions in all states with a positive advantage, and Simplex-PI that only changes the action in the state with maximal advantage. We show that Howard's PI terminates after at most O(nm1−γlog(11−γ)) iterations, improving by a factor O(logn) a result by Hansen et al. (2013), while Simplex-PI terminates after at most O(n2m1−γlog(11−γ)) iterations, improving by a factor O(logn) a result by Ye (2011). Under some structural assumptions of the MDP, we then consider bounds that are independent of the discount factor~γ: given a measure of the maximal transient time τt and the maximal time τr to revisit states in recurrent classes under all policies, we show that Simplex-PI terminates after at most ˜O(n3m2τtτr) iterations. This generalizes a recent result for deterministic MDPs by Post & Ye (2012), in which τt≤n and τr≤n. We explain why similar results seem hard to derive for Howard's PI. Finally, under the additional (restrictive) assumption that the state space is partitioned in two sets, respectively states that are transient and recurrent for all policies, we show that Simplex-PI and Howard's PI terminate after at most ˜O(nm(τt+τr)) iterations.
Live content is unavailable. Log in and register to view live content