Affinity Workshop: WiML Workshop 1

Decaying Clipping Range in Proximal Policy Optimization

Mónika Farsang


Proximal Policy Optimization (PPO) [1] is among the most widely used algorithms in reinforcement learning, which achieves state-of-the-art performance in many challenging problems. The keys to its success are the reliable policy updates through the clipping mechanism and the multiple epochs of minibatch updates. The aim of this research is to give new simple but effective alternatives to the former. For this, the new methods that we propose include linear, exponential and Z-shaped curve clipping range reduction throughout the training, as well as a moving average approach. With these, we would like to provide higher exploration at the beginning and stronger restrictions at the end of the learning phase. We investigate their performance in several classical control and locomotive robotic simulation environments in which we test and compare the performance of the alternative algorithms. These include the solution of simpler classical control tasks in the OpenAI Gym environments and slightly more complex continuous control tasks in the Box2D simulator, along with locomotive robotic control problems in the PyBullet environments.

In our analysis, we conclude that the examined PPO algorithm can be successfully applied in all nine environments studied, which shows its power and provides insight into why it is so popular nowadays. Furthermore, our proposed clipping range strategies, which are designed to further refine this state-of-the-art method, are able to achieve better results in several cases than the original constant approach, especially the exponential and Z-shaped declining strategies. However, the OpenAI Gym Box2D environments show that these approaches are not always successful, which is not surprising since there is usually no general solution for all situations. Although they are promising alternatives to the constant clipping method.

Chat is not available.