Timezone: »

Are We Learning Yet? A Meta Review of Evaluation Failures Across Machine Learning
Thomas Liao · Rohan Taori · Deborah Raji · Ludwig Schmidt

Many subfields of machine learning share a common stumbling block: evaluation. Advances in machine learning often evaporate under closer scrutiny or turn out to be less widely applicable than originally hoped. We conduct a meta-review of 107 survey papers from natural language processing, recommender systems, computer vision, reinforcement learning, computational biology, graph learning, and more, organizing the wide range of surprisingly consistent critique into a concrete taxonomy of observed failure modes. Inspired by measurement and evaluation theory, we divide failure modes into two categories: internal and external validity. Internal validity issues pertain to evaluation on a learning problem in isolation, such as improper comparisons to baselines or overfitting from test set re-use. External validity relies on relationships between different learning problems, for instance, whether progress on a learning problem translates to progress on seemingly related tasks.

Author Information

Thomas Liao (Scale AI)
Rohan Taori (Stanford University)
Deborah Raji (UC Berkeley)
Ludwig Schmidt (University of Washington)

More from the Same Authors