Poster
in
Workshop: Second Workshop on Aligning Reinforcement Learning Experimentalists and Theorists (ARLET)

Open Problem: Order Optimal Regret Bounds for Non-Markovian Rewards

Aya Shabbar

Project Page [ OpenReview]

Abstract

The standard RL world model is that of a Markov Decision Process (MDP) that assumes Markovian transitions and rewards. Yet, many real-world rewards are non-Markovian. A basic premise of MDPs is that the rewards depend on the last state and action only. Some problem settings involve "double-state" or non-Markovian reward functions where the reward depends on the trajectory. Past work considered the problem of modeling and solving MDPs with non-Markovian rewards (NMR), but we know of no principled approaches for RL with NMR. This approach is particularly interesting as it naturally extends the MDP structure. Thus, we will address the problem of policy learning from experience with such rewards. This exacerbates the misalignment between theoretical researchers and practitioners. An open problem is to develop algorithms that can efficiently solve such problems and provide provable regret bounds, even with knowledge of the transition model. We will highlight this open problem and discuss related challenges.

Chat is not available.