Timezone: »
Poster
Testing Closeness With Unequal Sized Samples
Bhaswar Bhattacharya · Gregory Valiant
We consider the problem of testing whether two unequalsized samples were drawn from identical distributions, versus distributions that differ significantly. Specifically, given a target error parameter $\eps > 0$, $m_1$ independent draws from an unknown distribution $p$ with discrete support, and $m_2$ draws from an unknown distribution $q$ of discrete support, we describe a test for distinguishing the case that $p=q$ from the case that $pq_1 \geq \eps$. If $p$ and $q$ are supported on at most $n$ elements, then our test is successful with high probability provided $m_1\geq n^{2/3}/\varepsilon^{4/3}$ and $m_2 = \Omega\left(\max\{\frac{n}{\sqrt m_1\varepsilon^2}, \frac{\sqrt n}{\varepsilon^2}\}\right).$ We show that this tradeoff is information theoretically optimal throughout this range, in the dependencies on all parameters, $n,m_1,$ and $\eps$, to constant factors. As a consequence, we obtain an algorithm for estimating the mixing time of a Markov chain on $n$ states up to a $\log n$ factor that uses $\tilde{O}(n^{3/2} \tau_{mix})$ queries to a ``next node'' oracle. The core of our testing algorithm is a relatively simple statistic that seems to perform well in practice, both on synthetic data and on natural language data. We believe that this statistic might prove to be a useful primitive within larger machine learning and natural language processing systems.
Author Information
Bhaswar Bhattacharya (Stanford University)
Gregory Valiant (Stanford University)
More from the Same Authors

2017 Poster: Learning Overcomplete HMMs »
Vatsal Sharan · Sham Kakade · Percy Liang · Gregory Valiant 
2017 Poster: Learning Populations of Parameters »
Kevin Tian · Weihao Kong · Gregory Valiant 
2016 Poster: Avoiding Imposters and Delinquents: Adversarial Crowdsourcing and Peer Prediction »
Jacob Steinhardt · Gregory Valiant · Moses Charikar 
2015 : When Your Big Data Seems Too Small »
Gregory Valiant