Timezone: »
Overfitting is the bane of data analysts, even when data are plentiful. Formal approaches to understanding this problem focus on statistical inference and generalization of individual analysis procedures. Yet the practice of data analysis is an inherently interactive and adaptive process: new analyses and hypotheses are proposed after seeing the results of previous ones, parameters are tuned on the basis of obtained results, and datasets are shared and reused. An investigation of this gap has recently been initiated by the authors in (Dwork et al., 2014), where we focused on the problem of estimating expectations of adaptively chosen functions.In this paper, we give a simple and practical method for reusing a holdout (or testing) set to validate the accuracy of hypotheses produced by a learning algorithm operating on a training set. Reusing a holdout set adaptively multiple times can easily lead to overfitting to the holdout set itself. We give an algorithm that enables the validation of a large number of adaptively chosen hypotheses, while provably avoiding overfitting. We illustrate the advantages of our algorithm over the standard use of the holdout set via a simple synthetic experiment.We also formalize and address the general problem of data reuse in adaptive data analysis. We show how the differential-privacy based approach in (Dwork et al., 2014) is applicable much more broadly to adaptive data analysis. We then show that a simple approach based on description length can also be used to give guarantees of statistical validity in adaptive settings. Finally, we demonstrate that these incomparable approaches can be unified via the notion of approximate max-information that we introduce. This, in particular, allows the preservation of statistical validity guarantees even when an analyst adaptively composes algorithms which have guarantees based on either of the two approaches.
Author Information
Cynthia Dwork (Microsoft Research)
Cynthia Dwork, Distinguished Scientist at Microsoft Research, is renowned for placing privacy-preserving data analysis on a mathematically rigorous foundation. A cornerstone of this work is differential privacy, a strong privacy guarantee frequently permitting highly accurate data analysis. Dr. Dwork has also made seminal contributions in cryptography and distributed computing, and is a recipient of the Edsger W. Dijkstra Prize, recognizing some of her earliest work establishing the pillars on which every fault-tolerant system has been built for decades. She is a member of the National Academy of Sciences and the National Academy of Engineering, and a Fellow of the American Academy of Arts and Sciences.
Vitaly Feldman (IBM Research - Almaden)
Moritz Hardt (Google)
Toni Pitassi (University of Toronto)
Omer Reingold (Samsung Research)
Aaron Roth (University of Pennsylvania)
More from the Same Authors
-
2020 Poster: What Neural Networks Memorize and Why: Discovering the Long Tail via Influence Estimation »
Vitaly Feldman · Chiyuan Zhang -
2020 Spotlight: What Neural Networks Memorize and Why: Discovering the Long Tail via Influence Estimation »
Vitaly Feldman · Chiyuan Zhang -
2020 Poster: Stability of Stochastic Gradient Descent on Nonsmooth Convex Losses »
Raef Bassily · Vitaly Feldman · Cristóbal Guzmán · Kunal Talwar -
2020 Spotlight: Stability of Stochastic Gradient Descent on Nonsmooth Convex Losses »
Raef Bassily · Vitaly Feldman · Cristóbal Guzmán · Kunal Talwar -
2019 Poster: Private Stochastic Convex Optimization with Optimal Rates »
Raef Bassily · Vitaly Feldman · Kunal Talwar · Abhradeep Guha Thakurta -
2019 Spotlight: Private Stochastic Convex Optimization with Optimal Rates »
Raef Bassily · Vitaly Feldman · Kunal Talwar · Abhradeep Guha Thakurta -
2019 Poster: Locally Private Learning without Interaction Requires Separation »
Amit Daniely · Vitaly Feldman -
2018 Poster: The Everlasting Database: Statistical Validity at a Fair Price »
Blake Woodworth · Vitaly Feldman · Saharon Rosset · Nati Srebro -
2018 Poster: Generalization Bounds for Uniformly Stable Algorithms »
Vitaly Feldman · Jan Vondrak -
2018 Spotlight: Generalization Bounds for Uniformly Stable Algorithms »
Vitaly Feldman · Jan Vondrak -
2018 Poster: Predict Responsibly: Improving Fairness and Accuracy by Learning to Defer »
David Madras · Toni Pitassi · Richard Zemel -
2017 Poster: Avoiding Discrimination through Causal Reasoning »
Niki Kilbertus · Mateo Rojas Carulla · Giambattista Parascandolo · Moritz Hardt · Dominik Janzing · Bernhard Schölkopf -
2017 Poster: Accuracy First: Selecting a Differential Privacy Level for Accuracy Constrained ERM »
Katrina Ligett · Seth Neel · Aaron Roth · Bo Waggoner · Steven Wu -
2016 Workshop: Adaptive Data Analysis »
Vitaly Feldman · Aaditya Ramdas · Aaron Roth · Adam Smith -
2016 Poster: Generalization of ERM in Stochastic Convex Optimization: The Dimension Strikes Back »
Vitaly Feldman -
2016 Oral: Generalization of ERM in Stochastic Convex Optimization: The Dimension Strikes Back »
Vitaly Feldman -
2016 Poster: Equality of Opportunity in Supervised Learning »
Moritz Hardt · Eric Price · Eric Price · Nati Srebro -
2015 Workshop: Adaptive Data Analysis »
Adam Smith · Aaron Roth · Vitaly Feldman · Moritz Hardt -
2015 Poster: Subsampled Power Iteration: a Unified Algorithm for Block Models and Planted CSP's »
Vitaly Feldman · Will Perkins · Santosh Vempala -
2015 Poster: Differentially Private Learning of Structured Discrete Distributions »
Ilias Diakonikolas · Moritz Hardt · Ludwig Schmidt -
2014 Workshop: Fairness, Accountability, and Transparency in Machine Learning »
Moritz Hardt · Solon Barocas -
2014 Invited Talk: Privacy in the Land of Plenty »
Cynthia Dwork -
2014 Poster: The Noisy Power Method: A Meta Algorithm with Applications »
Moritz Hardt · Eric Price -
2014 Spotlight: The Noisy Power Method: A Meta Algorithm with Applications »
Moritz Hardt · Eric Price -
2013 Poster: Statistical Active Learning Algorithms »
Maria-Florina F Balcan · Vitaly Feldman -
2013 Poster: On the Expressive Power of Restricted Boltzmann Machines »
James Martens · Arkadev Chattopadhya · Toni Pitassi · Richard Zemel