Skip to yearly menu bar Skip to main content

Workshop: 3rd Offline Reinforcement Learning Workshop: Offline RL as a "Launchpad"

Offline evaluation in RL: soft stability weighting to combine fitted Q-learning and model-based methods

Briton Park · Xian Wu · Bin Yu · Angela Zhou


The goal of offline policy evaluation (OPE) is to evaluate target policies based on logged data under a different distribution. Because no one method is uniformly best, model selection is important, but difficult without online exploration. We propose soft stability weighting (SSW) for adaptively combining offline estimates from ensembles of fitted-Q-evaluation (FQE) and model-based evaluation methods generated by different random initializations of neural networks. Soft stability weighting computes a state-action-conditional weighted average of the median FQE and model-based prediction by normalizing the state-action-conditional standard deviation of ensembles of both methods relative to the average standard deviation of each method. Therefore it compares the relative stability of predictions in the ensemble to the perturbations from random initializations, drawn from a truncated normal distribution scaled by the input feature size.

Chat is not available.