Skip to yearly menu bar Skip to main content


Poster

Testing Closeness With Unequal Sized Samples

Bhaswar Bhattacharya · Gregory Valiant

210 C #66

Abstract: We consider the problem of testing whether two unequal-sized samples were drawn from identical distributions, versus distributions that differ significantly. Specifically, given a target error parameter \eps>0, m1 independent draws from an unknown distribution p with discrete support, and m2 draws from an unknown distribution q of discrete support, we describe a test for distinguishing the case that p=q from the case that ||pq||1\eps. If p and q are supported on at most n elements, then our test is successful with high probability provided m1n2/3/ε4/3 and m2=Ω(max{nm1ε2,nε2}). We show that this tradeoff is information theoretically optimal throughout this range, in the dependencies on all parameters, n,m1, and \eps, to constant factors. As a consequence, we obtain an algorithm for estimating the mixing time of a Markov chain on n states up to a logn factor that uses O~(n3/2τmix) queries to a next node'' oracle. The core of our testing algorithm is a relatively simple statistic that seems to perform well in practice, both on synthetic data and on natural language data. We believe that this statistic might prove to be a useful primitive within larger machine learning and natural language processing systems.

Live content is unavailable. Log in and register to view live content