NeurIPS Poster Tempo Adaptation in Non-stationary Reinforcement Learning

Poster

Tempo Adaptation in Non-stationary Reinforcement Learning

Hyunin Lee · Yuhao Ding · Jongmin Lee · Ming Jin · Javad Lavaei · Somayeh Sojoudi

Great Hall & Hall B1+B2 (level 1) #725

[ Abstract ] [ Project Page ]

[ Paper] [ Slides] [ Poster] [ OpenReview]

Abstract: We first raise and tackle a

time synchronization'' issue between the agent and the environment in non-stationary reinforcement learning (RL), a crucial factor hindering its real-world applications. In reality, environmental changes occur over wall-clock time (

t

$t$ ) rather than episode progress (

k

$k$ ), where wall-clock time signifies the actual elapsed time within the fixed duration

t \in [0, T]

$t \in [0, T]$ . In existing works, at episode

k

$k$ , the agent rolls a trajectory and trains a policy before transitioning to episode

k + 1

$k+1$ . In the context of the time-desynchronized environment, however, the agent at time

t_{k}

$t_{k}$ allocates

Δ t

$\Delta t$ for trajectory generation and training, subsequently moves to the next episode at

t_{k + 1} = t_{k} + Δ t

$t_{k+1}=t_{k}+\Delta t$ . Despite a fixed total number of episodes (

K

$K$ ), the agent accumulates different trajectories influenced by the choice of interaction times (

t_{1}, t_{2}, . . ., t_{K}

$t_1,t_2,...,t_K$ ), significantly impacting the suboptimality gap of the policy. We propose a Proactively Synchronizing Tempo (

ProST

$\texttt{ProST}$ ) framework that computes a suboptimal sequence {

t_{1}, t_{2}, . . ., t_{K}

$t_1,t_2,...,t_K$ } (= {

t_{1 : K}

$t_{1:K}$ }) by minimizing an upper bound on its performance measure, i.e., the dynamic regret. Our main contribution is that we show that a suboptimal {

t_{1 : K}

$t_{1:K}$ } trades-off between the policy training time (agent tempo) and how fast the environment changes (environment tempo). Theoretically, this work develops a suboptimal {

t_{1 : K}

$t_{1:K}$ } as a function of the degree of the environment's non-stationarity while also achieving a sublinear dynamic regret. Our experimental evaluation on various high-dimensional non-stationary environments shows that the

ProST

$\texttt{ProST}$ framework achieves a higher online return at suboptimal {

t_{1 : K}

$t_{1:K}$ } than the existing methods.

Chat is not available.