Timezone: »

Online Stochastic Shortest Path with Bandit Feedback and Unknown Transition Function
Aviv Rosenberg · Yishay Mansour

Thu Dec 12 05:00 PM -- 07:00 PM (PST) @ East Exhibition Hall B + C #23
We consider online learning in episodic loop-free Markov decision processes (MDPs), where the loss function can change arbitrarily between episodes. The transition function is fixed but unknown to the learner, and the learner only observes bandit feedback (not the entire loss function). For this problem we develop no-regret algorithms that perform asymptotically as well as the best stationary policy in hindsight. Assuming that all states are reachable with probability $\beta > 0$ under any policy, we give a regret bound of $\tilde{O} ( L|X|\sqrt{|A|T} / \beta )$, where $T$ is the number of episodes, $X$ is the state space, $A$ is the action space, and $L$ is the length of each episode. When this assumption is removed we give a regret bound of $\tilde{O} ( L^{3/2} |X| |A|^{1/4} T^{3/4})$, that holds for an arbitrary transition function. To our knowledge these are the first algorithms that in our setting handle both bandit feedback and an unknown transition function.

Author Information

Aviv Rosenberg (Tel Aviv University)
Aviv Rosenberg

I am an Applied Scientist at Amazon Alexa Shopping, Tel Aviv. Previously, I obtained my PhD from the department of computer science at Tel Aviv University, where I was fortunate to have Prof. Yishay Mansour as my advisor. Prior to that, I received my Bachelor's degree in Mathematics and Computer Science from Tel Aviv University. My primary research interest lies in theoretical and applied machine learning. More specifically, my PhD focused on data-driven sequential decision making such as reinforcement learning, online learning and multi-armed bandit.

Yishay Mansour (Tel Aviv University / Google)

More from the Same Authors