NeurIPS Poster Regret Minimization for Reinforcement Learning with Vectorial Feedback and Complex Objectives

Poster

Regret Minimization for Reinforcement Learning with Vectorial Feedback and Complex Objectives

Wang Chi Cheung

East Exhibition Hall B, C #204

Keywords: [ Bandit Algorithms ] [ Algorithms ] [ Algorithms -> Online Learning; Optimization ] [ Stochastic Optimization; Reinforcement Learning and Planning; Reinforcement Lear ]

[ Abstract ]

Abstract:

We consider an agent who is involved in an online Markov decision process, and receives a vector of outcomes every round. The agent aims to simultaneously optimize multiple objectives associated with the multi-dimensional outcomes. Due to state transitions, it is challenging to balance the vectorial outcomes for achieving near-optimality. In particular, contrary to the single objective case, stationary policies are generally sub-optimal. We propose a no-regret algorithm based on the Frank-Wolfe algorithm (Frank and Wolfe 1956), UCRL2 (Jaksch et al. 2010), as well as a crucial and novel gradient threshold procedure. The procedure involves carefully delaying gradient updates, and returns a non-stationary policy that diversifies the outcomes for optimizing the objectives.

Live content is unavailable. Log in and register to view live content