Skip to yearly menu bar Skip to main content

Workshop: I Can’t Believe It’s Not Better: Understanding Deep Learning Through Empirical Falsification

The Best Deep Ensembles Sacrifice Predictive Diversity

Taiga Abe · Estefany Kelly Buchanan · Geoff Pleiss · John Cunningham


Ensembling remains a hugely popular method for increasing the performance of a given class of models. In the case of deep learning, the benefits of ensembling are often attributed to the diverse predictions of the individual ensemble members. Here we investigate a tradeoff between diversity and individual model performance, and find that--surprisingly--encouraging diversity during training almost always yields worse ensembles. We show that this tradeoff arises from the Jensen gap between the single model and ensemble losses, and show that Jensen gap is a natural measure of diversity for both the mean squared error and cross entropy loss functions. Our results suggest that to reduce the ensemble error, we should move away from efforts to increase predictive diversity, and instead we should construct ensembles from less diverse (but more accurate) component models.

Chat is not available.