Timezone: »

Model-based Distributional Reinforcement Learning for Risk-sensitive Control
Hao Liang · Zhiquan Luo

Tue Dec 14 09:00 AM -- 10:00 AM (PST) @
We consider finite episodic Markov decision processes aiming at the entropic risk measure of return for risk-sensitive control. We identify several properties of the entropic risk measure that establishes distributional dynamic programming. We propose a novel model-based distributional reinforcement learning (DRL) algorithm, \textbf{R}isk-sensitive \textbf{O}ptimistic \textbf{D}istribution \textbf{I}teration (RODI), that implements optimism through three different subroutines. We prove that all of them attain $\tilde{O}(\frac{\exp(|\beta| H)-1}{|\beta|}\exp(|\beta| H^2)H\sqrt{S^2AK})$ regret upper bound, where $S$ is the number of states, $A$ the number of states, $H$ the time horizon and $K$ the number of episodes. It matches RSVI in the previous work and its regret analysis are conceptually simple and can be easily extended to general risk measures satisfying several key properties. To the best of our knowledge, this is the first regret analysis of DRL, which theoretically verifies the efficacy of DRL for risk-sensitive control. We find that the proof of lower bound in existing work contains mistakes and the corrected proof only implies an $\Omega(\frac{\exp(|\beta| H/2)-1}{|\beta|}\sqrt{K})$ regret, which is irrelevant to $S, A$ and loose in the polynomial dependency on $H$. We improve the result by proving a tighter lower bound of $\Omega(\frac{\exp(\beta H/6)-1}{\beta H}H\sqrt{SAT})$ for $\beta>0$ case.