Timezone: »
Adaptive data analysis is the increasingly common practice by which insights gathered from data are used to inform further analysis of the same data sets. This is common practice both in machine learning, and in scientific research, in which data-sets are shared and re-used across multiple studies. Unfortunately, most of the statistical inference theory used in empirical sciences to control false discovery rates, and in machine learning to avoid overfitting, assumes a fixed class of hypotheses to test, or family of functions to optimize over, selected independently of the data. If the set of analyses run is itself a function of the data, much of this theory becomes invalid, and indeed, has been blamed as one of the causes of the crisis of reproducibility in empirical science.
Recently, there have been several exciting proposals for how to avoid overfitting and guarantee statistical validity even in general adaptive data analysis settings. The problem is important, and ripe for further advances. The goal of this workshop is to bring together members of different communities (from machine learning, statistics, and theoretical computer science) interested in solving this problem, to share recent results, to discuss promising directions for future research, and to foster collaborations.
Thu 11:55 p.m. - 12:00 a.m.
|
Introductory remarks
(Introduction)
|
🔗 |
Fri 12:00 a.m. - 12:35 a.m.
|
Ruth Heller. Inference following aggregate level hypothesis testing in large scale genomic data
(Talk)
In many genomic applications, it is common to perform tests using aggregate-level statistics within naturally defined classes for powerful identification of signals. Following aggregate-level testing, it is naturally of interest to infer on the individual units that are within classes that contain signal. Failing to account for class selection will produce biased inference. We develop multiple testing procedures that allow rejection of individual level null hypotheses while controlling for conditional (familywise or false discovery) error rates. We use simulation studies to illustrate validity and power of the proposed procedures in comparison to several possible alternatives. We illustrate the usefulness of our procedures in a natural application involving whole-genome expression quantitative trait loci (eQTL) analysis across 17 tissue types using data from The Cancer Genome Atlas (TCGA) Project. Joint work with Nilanjan Chatterjee, Abba Krieger, and Jianxin Shi. |
🔗 |
Fri 12:35 a.m. - 1:10 a.m.
|
Weijie Su. Private false discovery rate control and robustness of the Benjamini-Hochberg procedure
(Talk)
We provide the first differentially private algorithms for controlling the false discovery rate (FDR) in multiple hypothesis testing. Our general approach is to adapt a well-known variant of the Benjamini-Hochberg procedure (BHq), making each step differentially private. This destroys the classical proof of FDR control. To prove FDR control of our method, we develop a new proof of the original (non-private) BHq algorithm and its robust variants -- a proof requiring only the assumption that the true null test statistics are independent, allowing for arbitrary correlations between the true nulls and false nulls. This assumption is fairly weak compared to those previously shown in the vast literature on this topic, and explains in part the empirical robustness of BHq. |
🔗 |
Fri 1:10 a.m. - 1:20 a.m.
|
Vitaly Feldman
(Discussion)
|
Vitaly Feldman 🔗 |
Fri 1:20 a.m. - 1:50 a.m.
|
Coffee break
(break)
|
🔗 |
Fri 1:50 a.m. - 3:00 a.m.
|
Short talks: Ibrahim Alabdulmohsin, Joshua Loftus, Yu-Xiang Wang, Sam Elder, Aaditya Ramdas, Ryan Rogers
(Talk)
10:50-11:00. Ibrahim Alabdulmohsin. On the Interplay between Information, Stability, and Generalization 11:00-11:10. Joshua Loftus. Significance testing after cross-validation 11:10-11:20. Yu-Xiang Wang, Jing Lei and Stephen E. Fienberg. A Minimax Theory for Adaptive Data Analysis 11:20-11:30. Sam Elder. Bayesian Adaptive Data Analysis: Challenges and Guarantees 11:30-11:40. Rina Foygel Barber and Aaditya Ramdas. p-filter: An internally consistent framework for FDR. 11:40-11:50. Ryan Rogers*, Aaron Roth, Adam Smith and Om Thakkar. Max-Information, Differential Privacy, and Post-Selection Hypothesis Testing |
🔗 |
Fri 3:00 a.m. - 5:30 a.m.
|
Lunch break
(break)
|
🔗 |
Fri 5:30 a.m. - 6:05 a.m.
|
Aaron Roth. Adaptive Data Analysis via Differential Privacy
(Talk)
|
🔗 |
Fri 6:05 a.m. - 6:40 a.m.
|
Katrina Ligett. Adaptive Learning with Robust Generalization Guarantees
(Talk)
The traditional notion of generalization --- i.e., learning a hypothesis whose empirical error is close to its true error --- is surprisingly brittle. As has recently been noted, even if several algorithms have this guarantee in isolation, the guarantee need not hold if the algorithms are composed adaptively. In this paper, we study three notions of generalization ---increasing in strength--- that are robust to post-processing and amenable to adaptive composition, and examine the relationships between them. |
🔗 |
Fri 6:50 a.m. - 7:35 a.m.
|
Posters
(Poster Session)
|
🔗 |
Fri 7:35 a.m. - 7:55 a.m.
|
Lucas Janson. Model-free knockoffs: statistical tools for reproducible selections
(Talk)
A common problem in modern statistical applications is to select, from a large set of candidates, a subset of variables which are important for determining an outcome of interest. For instance, the outcome may be disease status and the variables may be hundreds of thousands of single nucleotide polymorphisms on the genome. This talk introduces model-free knockoffs, a framework for finding dependent variables while provably controlling the false discovery rate (FDR) in finite samples. FDR control holds no matter the form of the dependence between the response and the covariates, which does not need to be specified in any way. What is required is that we observe i.i.d. samples (X,Y) and know something about the distribution of the covariates although we have shown that the method is robust to unknown/estimated covariate distributions. This framework builds on the knockoff filter of Foygel Barber and Candès introduced a couple of years ago, which was limited to linear models with fewer variables than observations (n ‹ p). In contrast, model-free knockoffs deal with a range of problems far beyond the scope of the original knockoff paper—e.g. it provides valid selections in any generalized linear model including logistic regression---while being more powerful than the original procedure when it applies. Finally, we apply our procedure to data from a case-control study of Crohn’s disease in the United Kingdom, making twice as many discoveries as the original analysis of the same data. |
🔗 |
Fri 7:55 a.m. - 8:15 a.m.
|
Xiaoying Harris. From Selective Inference to Adaptive Data Analysis
(Talk)
Recent development in selective inference has provided a framework of valid inference after some information of the data has been used for model selection. However, most literature concerning selective inference require the practitioners to commit to a pre-specified procedure for model selection. This is rather stringent for applications. In many cases, multiple exploratory data analyses will be performed and the outcome of each will be input to the final model selected by the practitioners. Therefore, we want to develop a framework that allows multiple queries to the data. In a framework similar to that in differential privacy, we allow valid inference after multiple queries to the database. We seek to address this problem from the perspective of “multiple views of the data” and two concrete examples are considered below. Joint work with Jonathan Taylor. |
🔗 |
Fri 8:15 a.m. - 8:50 a.m.
|
Peter Grunwald. Safe Testing: An Adaptive Alternative to p-value-based testing
(Talk)
Standard p-value based hypothesis testing is not at all adaptive: if our test result is promising but not conclusive (say, p = 0.07) we cannot simply decide to gather a few more data points. While the latter practice is ubiquitous in science, it invalidates p-values and error guarantees. Here we propose an alternative test based on supermartingales - it has both a gambling and a data compression interpretation. This method allows us to freely combine results from different tests by multiplication (which would be a mortal sin for p-values!), and avoids many other pitfalls of traditional testing as well. If the null hypothesis is simple (a singleton), it also has a Bayesian interpretation, and essentially coincides with a proposal by Vovk (1993) and Berger et al. (1994). Here we work out, for the first time, the case of composite null hypotheses, which allows us to formulate safe, nonasymptotic versions of the most popular tests such as the t-test and the chi square tests. Safe tests for composite H0 are not Bayesian, and initial experiments suggests that they can substantially outperform Bayesian tests (which for composite nulls are not adaptive in general). |
🔗 |
Fri 8:50 a.m. - 9:00 a.m.
|
Aaron Roth
(Discussion)
|
🔗 |
Author Information
Vitaly Feldman (Google Brain)
Aaditya Ramdas (UC Berkeley)
Aaron Roth (University of Pennsylvania)
Adam Smith (Pennsylvania State University)
More from the Same Authors
-
2020 : Individual Privacy Accounting via a Rényi Filter »
Vitaly Feldman -
2020 : Hiding Among the Clones: A Simple and Nearly Optimal Analysis of Privacy Amplification by Shuffling »
Vitaly Feldman -
2021 : Mean Estimation with User-level Privacy under Data Heterogeneity »
Rachel Cummings · Vitaly Feldman · Audra McMillan · Kunal Talwar -
2022 : Differentially Private Gradient Boosting on Linear Learners for Tabular Data »
Saeyoung Rho · Shuai Tang · Sergul Aydore · Michael Kearns · Aaron Roth · Yu-Xiang Wang · Steven Wu · Cedric Archambeau -
2022 Poster: Mean Estimation with User-level Privacy under Data Heterogeneity »
Rachel Cummings · Vitaly Feldman · Audra McMillan · Kunal Talwar -
2022 Poster: Online Minimax Multiobjective Optimization: Multicalibeating and Other Applications »
Daniel Lee · Georgy Noarov · Mallesh Pai · Aaron Roth -
2022 Poster: Practical Adversarial Multivalid Conformal Prediction »
Osbert Bastani · Varun Gupta · Christopher Jung · Georgy Noarov · Ramya Ramalingam · Aaron Roth -
2022 Poster: Private Synthetic Data for Multitask Learning and Marginal Queries »
Giuseppe Vietri · Cedric Archambeau · Sergul Aydore · William Brown · Michael Kearns · Aaron Roth · Ankit Siva · Shuai Tang · Steven Wu -
2022 Poster: Subspace Recovery from Heterogeneous Data with Non-isotropic Noise »
John Duchi · Vitaly Feldman · Lunjia Hu · Kunal Talwar -
2021 : Panel »
Oluwaseyi Feyisetan · Helen Nissenbaum · Aaron Roth · Christine Task -
2021 : Invited talk: Aaron Roth (UPenn / Amazon): Machine Unlearning. »
Aaron Roth -
2021 Poster: Adaptive Machine Unlearning »
Varun Gupta · Christopher Jung · Seth Neel · Aaron Roth · Saeed Sharifi-Malvajerdi · Chris Waites -
2021 Poster: Individual Privacy Accounting via a Rényi Filter »
Vitaly Feldman · Tijana Zrnic -
2020 Poster: What Neural Networks Memorize and Why: Discovering the Long Tail via Influence Estimation »
Vitaly Feldman · Chiyuan Zhang -
2020 Spotlight: What Neural Networks Memorize and Why: Discovering the Long Tail via Influence Estimation »
Vitaly Feldman · Chiyuan Zhang -
2020 Poster: Stability of Stochastic Gradient Descent on Nonsmooth Convex Losses »
Raef Bassily · Vitaly Feldman · Cristóbal Guzmán · Kunal Talwar -
2020 Spotlight: Stability of Stochastic Gradient Descent on Nonsmooth Convex Losses »
Raef Bassily · Vitaly Feldman · Cristóbal Guzmán · Kunal Talwar -
2019 : Aaron Roth, "Average Individual Fairness" »
Aaron Roth -
2019 : Private Stochastic Convex Optimization: Optimal Rates in Linear Time »
Vitaly Feldman · Tomer Koren · Kunal Talwar -
2019 : Poster Session »
Clement Canonne · Kwang-Sung Jun · Seth Neel · Di Wang · Giuseppe Vietri · Liwei Song · Jonathan Lebensold · Huanyu Zhang · Lovedeep Gondara · Ang Li · FatemehSadat Mireshghallah · Jinshuo Dong · Anand D Sarwate · Antti Koskela · Joonas Jälkö · Matt Kusner · Dingfan Chen · Mi Jung Park · Ashwin Machanavajjhala · Jayashree Kalpathy-Cramer · · Vitaly Feldman · Andrew Tomkins · Hai Phan · Hossein Esfandiari · Mimansa Jaiswal · Mrinank Sharma · Jeff Druce · Casey Meehan · Zhengli Zhao · Hsiang Hsu · Davis Railsback · Abraham Flaxman · · Julius Adebayo · Aleksandra Korolova · Jiaming Xu · Naoise Holohan · Samyadeep Basu · Matthew Joseph · My Thai · Xiaoqian Yang · Ellen Vitercik · Michael Hutchinson · Chenghong Wang · Gregory Yauney · Yuchao Tao · Chao Jin · Si Kai Lee · Audra McMillan · Rauf Izmailov · Jiayi Guo · Siddharth Swaroop · Tribhuvanesh Orekondy · Hadi Esmaeilzadeh · Kevin Procopio · Alkis Polyzotis · Jafar Mohammadi · Nitin Agrawal -
2019 : Invited talk #3 »
Aaron Roth -
2019 Poster: Private Stochastic Convex Optimization with Optimal Rates »
Raef Bassily · Vitaly Feldman · Kunal Talwar · Abhradeep Guha Thakurta -
2019 Spotlight: Private Stochastic Convex Optimization with Optimal Rates »
Raef Bassily · Vitaly Feldman · Kunal Talwar · Abhradeep Guha Thakurta -
2019 Poster: Locally Private Learning without Interaction Requires Separation »
Amit Daniely · Vitaly Feldman -
2018 : Contributed talk 1: Privacy Amplification by Iteration »
Vitaly Feldman -
2018 Poster: The Everlasting Database: Statistical Validity at a Fair Price »
Blake Woodworth · Vitaly Feldman · Saharon Rosset · Nati Srebro -
2018 Poster: Generalization Bounds for Uniformly Stable Algorithms »
Vitaly Feldman · Jan Vondrak -
2018 Spotlight: Generalization Bounds for Uniformly Stable Algorithms »
Vitaly Feldman · Jan Vondrak -
2017 Poster: Accuracy First: Selecting a Differential Privacy Level for Accuracy Constrained ERM »
Katrina Ligett · Seth Neel · Aaron Roth · Bo Waggoner · Steven Wu -
2016 Workshop: Adaptive and Scalable Nonparametric Methods in Machine Learning »
Aaditya Ramdas · Arthur Gretton · Bharath Sriperumbudur · Han Liu · John Lafferty · Samory Kpotufe · Zoltán Szabó -
2016 : Vitaly Feldman »
Vitaly Feldman -
2016 Poster: Generalization of ERM in Stochastic Convex Optimization: The Dimension Strikes Back »
Vitaly Feldman -
2016 Oral: Generalization of ERM in Stochastic Convex Optimization: The Dimension Strikes Back »
Vitaly Feldman -
2015 Workshop: Adaptive Data Analysis »
Adam Smith · Aaron Roth · Vitaly Feldman · Moritz Hardt -
2015 Poster: Generalization in Adaptive Data Analysis and Holdout Reuse »
Cynthia Dwork · Vitaly Feldman · Moritz Hardt · Toni Pitassi · Omer Reingold · Aaron Roth -
2015 Poster: Private Graphon Estimation for Sparse Graphs »
Christian Borgs · Jennifer Chayes · Adam Smith -
2015 Poster: Subsampled Power Iteration: a Unified Algorithm for Block Models and Planted CSP's »
Vitaly Feldman · Will Perkins · Santosh Vempala -
2013 Poster: Statistical Active Learning Algorithms »
Maria-Florina F Balcan · Vitaly Feldman -
2013 Poster: (Nearly) Optimal Algorithms for Private Online Learning in Full-information and Bandit Settings »
Abhradeep Guha Thakurta · Adam Smith