Affinity Workshop: WiML Workshop 1

The Two-sample Problem in High Dimension: A Ranking-based Method

Myrto Limnios · Stephan Clémençon · Nicolas Vayatis


In this work, we propose a general framework for testing the equality of two unknown probability distributions, when considering two independent i.i.d. random samples, that are valued on the (same) measurable space. While there exists a long-standing literature for the univariate setting, this problem remains a subject of research for both the multivariate and nonparametric frameworks. Indeed, the increasing ability to collect large, even massive data, that is possibly biased due to the collection process for instance, and of various structure, has strongly defied classical modelings, in particular in applied fields such as in biomedicine (e.g. clinical trials, genomics), in marketing (e.g. A/B testing), in economics, etc. The present method generalizes a particular class of permutation statistics known as 'two-sample linear rank statistics'. We overcome the lack of natural order in the multivariate feature space thanks to the comparison of the univariate 'projected' observations using a scoring function valued in the real line. In particular, our methods consists in a two-stage procedure. (i) 'Maximization of the rank statistic': on the first half of each sample, we optimize a tailored version of the two-sample rank statistic over the class of scoring functions by means of ranking-based algorithms, (ii) 'Two-sample homogeneity test': for a given level of test, the univariate rank test is performed on the remaining observations that have been scored with the optimal function of step (i). We accompany our method with theoretical guarantees and with various numerical experiments, that intend to model complex structures of data to incidentally question both existing and present statistical tests.

Chat is not available.