Skip to yearly menu bar Skip to main content


Poster

Regret Minimization for Reinforcement Learning with Vectorial Feedback and Complex Objectives

Wang Chi Cheung

East Exhibition Hall B, C #204

Keywords: [ Bandit Algorithms ] [ Algorithms ] [ Algorithms -> Online Learning; Optimization ] [ Stochastic Optimization; Reinforcement Learning and Planning; Reinforcement Lear ]


Abstract:

We consider an agent who is involved in an online Markov decision process, and receives a vector of outcomes every round. The agent aims to simultaneously optimize multiple objectives associated with the multi-dimensional outcomes. Due to state transitions, it is challenging to balance the vectorial outcomes for achieving near-optimality. In particular, contrary to the single objective case, stationary policies are generally sub-optimal. We propose a no-regret algorithm based on the Frank-Wolfe algorithm (Frank and Wolfe 1956), UCRL2 (Jaksch et al. 2010), as well as a crucial and novel gradient threshold procedure. The procedure involves carefully delaying gradient updates, and returns a non-stationary policy that diversifies the outcomes for optimizing the objectives.

Live content is unavailable. Log in and register to view live content