Contextual Value Iteration and Deep Approximation for Bayesian Contextual Bandits
Kevin Duijndam · Ger Koole · Rob van der Mei
Abstract
We present a Bayesian value-iteration framework for contextual multi-armed bandit problems that treats the agent’s posterior distribution for the pay-off as the state of the Markov Decision Process. We apply finite-dimensional priors on the unknown reward parameters, and the exogenous context transition kernel. Value iteration on the belief-MDP yields an optimal policy. We illustrate the approach in an airline seat-pricing simulation. To address the curse of dimensionality, we approximate the value function with a dual-stream deep learning network and benchmark our deep value iteration algorithm on a standard contextual bandit instance.
Chat is not available.
Successful Page Load