Lucas Janson. Model-free knockoffs: statistical tools for reproducible selections
in
Workshop: Adaptive Data Analysis
Abstract
A common problem in modern statistical applications is to select, from a large set of candidates, a subset of variables which are important for determining an outcome of interest. For instance, the outcome may be disease status and the variables may be hundreds of thousands of single nucleotide polymorphisms on the genome. This talk introduces model-free knockoffs, a framework for finding dependent variables while provably controlling the false discovery rate (FDR) in finite samples. FDR control holds no matter the form of the dependence between the response and the covariates, which does not need to be specified in any way. What is required is that we observe i.i.d. samples (X,Y) and know something about the distribution of the covariates although we have shown that the method is robust to unknown/estimated covariate distributions. This framework builds on the knockoff filter of Foygel Barber and Candès introduced a couple of years ago, which was limited to linear models with fewer variables than observations (n ‹ p). In contrast, model-free knockoffs deal with a range of problems far beyond the scope of the original knockoff paper—e.g. it provides valid selections in any generalized linear model including logistic regression---while being more powerful than the original procedure when it applies. Finally, we apply our procedure to data from a case-control study of Crohn’s disease in the United Kingdom, making twice as many discoveries as the original analysis of the same data.