Timezone: »

Behavior Policy Search for Risk Estimators in Reinforcement Learning
Elita Lobo · Marek Petrik · Dharmashankar Subramanian

In real-world sequential decision problems, exploration is expensive, and the risk1of expert decision policies must be evaluated from limited data. In this setting, Monte Carlo (MC) risk estimators are typically used to estimate the risks associated with decision policies. While these estimators have the desired low bias property, they often suffer from large variance. In this paper, we consider the problem of minimizing the asymptotic mean squared error and hence variance of MC risk estimators. We show that by carefully choosing the data sampling policy (behavior policy), we can obtain low variance estimates of the risk of any given decision policy.

Author Information

Elita Lobo (University of Massachusetts Amherst)

I am a second year Ms-PhD student currently working with Professor Marek Petrik in the field of Robust RL. Previously I worked with Professor Rod Grupen on developing a Hierarchical Reinforcement Learning framework for generating diverse skills and with Professor Prashant Shenoy on peak energy demand day prediction for energy grid. Prior to pursuing research in UMass, I worked as a software engineer in one of the the top e-commerce company - Flipkart in India. I am strongly interested in pursuing my research in the field of Reinforcement Learning and Optimizations.

Marek Petrik (University of New Hampshire)
Dharmashankar Subramanian (IBM Research)

More from the Same Authors