Poster

Steady State Analysis of Episodic Reinforcement Learning

Huang Bojun

2020 Poster

Paper PDF [ Paper ]

Abstract

Reinforcement Learning (RL) tasks generally divide into two kinds: continual learning and episodic learning. The concept of steady state has played a foundational role in the continual setting, where unique steady-state distribution is typically presumed to exist in the task being studied, which enables principled conceptual framework as well as efficient data collection method for continual RL algorithms. On the other hand, the concept of steady state has been widely considered irrelevant for episodic RL tasks, in which the decision process terminates in finite time. Alternative concepts, such as episode-wise visitation frequency, are used in episodic RL algorithms, which are not only inconsistent with their counterparts in continual RL, and also make it harder to design and analyze RL algorithms in the episodic setting.

This paper proves that the episodic learning environment of every finite-horizon decision task has a unique steady state under any behavior policy, and that the marginal distribution of the agent's input indeed converges to the steady-state distribution in essentially all episodic learning processes. This observation supports an interestingly reversed mindset against conventional wisdom: While the existence of unique steady states was often presumed in continual learning but considered less relevant in episodic learning, it turns out their existence is guaranteed for the latter. Based on this insight, the paper unifies episodic and continual RL around several important concepts that have been separately treated in these two RL formalisms. Practically, the existence of unique and approachable steady state enables a general way to collect data in episodic RL tasks, which the paper applies to policy gradient algorithms as a demonstration, based on a new steady-state policy gradient theorem. Finally, the paper also proposes and experimentally validates a perturbation method that facilitates rapid steady-state convergence in real-world RL tasks.

Video

Chat is not available.