Plenary Talk
in
Workshop: The Optimization Foundations of Reinforcement Learning

Learning in structured MDPs with convex cost function: improved regret bounds for inventory management

Shipra Agrawal

2019 Plenary Talk
in
Workshop: The Optimization Foundations of Reinforcement Learning

Abstract

We present a learning algorithm for the stochastic inventory control problem under lost sales penalty and positive lead times, when the demand distribution is a priori unknown. Our main result is a regret bound of O(L\sqrt{T}+D) for the algorithm, where T is the time horizon, L is the fixed and known lead time, and D is an unknown parameter of the demand distribution described roughly as the number of time steps needed to generate enough demand for depleting one unit of inventory. Our results significantly improve the existing regret bounds for this problem. Notably, even though the state space of the underlying Markov Decision Process (MDP) in this problem is continuous and L-dimensional, our regret bounds depend linearly on L. Our techniques utilize convexity of the long run average cost and a newly derived bound on the `bias' of base-stock policies, to establish an almost blackbox connection between the problem of learning and optimization in such MDPs and stochastic convex bandit optimization. The techniques presented here may be of independent interest for other settings that involve learning large structured MDPs but with convex cost functions.

Chat is not available.