Timezone: »

Variance Reduction in Off-Policy Deep Reinforcement Learning using Spectral Normalization
Payal Bawa · Rafael Oliveira · Fabio Ramos
Event URL: https://openreview.net/forum?id=v9Wq-mycD4r »

Off-policy deep reinforcement learning algorithms like Soft Actor Critic (SAC) have achieved state-of-the-art results in several high dimensional continuous control tasks. Despite their success, they are prone to instability due to the \textit{deadly triad} of off-policy training, function approximation, and bootstrapping. Unstable training of off-policy algorithms leads to sample inefficient and sub-optimal asymptotic performance, thus preventing their real-world deployment. To mitigate these issues, previously proposed solutions have focused on advances like target networks to alleviate instability and the introduction of twin critics to address overestimation bias. However, these modifications fail to address the issue of noisy gradient estimation with excessive variance, resulting in instability and slow convergence. Our proposed method, Spectral Normalized Actor Critic (SNAC), regularizes the actor and the critics using spectral normalization to systematically bound the gradient norm. Spectral normalization constrains the magnitudes of the gradients resulting in smoother actor-critics with robust and sample-efficient performance thus making them suitable for deployment in stability-critical and compute-constrained applications. We present empirical results on several challenging reinforcement learning benchmarks and extensive ablation studies to demonstrate the effectiveness of our proposed method.

Author Information

Payal Bawa (University of Sydney)
Rafael Oliveira (The University of Sydney)
Fabio Ramos (University of Sydney, NVIDIA)

More from the Same Authors