NIPS Poster Testing Closeness With Unequal Sized Samples

Poster

Testing Closeness With Unequal Sized Samples

Bhaswar Bhattacharya · Gregory Valiant

210 C #66

[ Abstract ]

Abstract: We consider the problem of testing whether two unequal-sized samples were drawn from identical distributions, versus distributions that differ significantly. Specifically, given a target error parameter

\eps > 0

$\eps > 0$ ,

m_{1}

$m_1$ independent draws from an unknown distribution

p

$p$ with discrete support, and

m_{2}

$m_2$ draws from an unknown distribution

q

$q$ of discrete support, we describe a test for distinguishing the case that

p = q

$p=q$ from the case that

| | p - q | |_{1} \geq \eps

$||p-q||_1 \geq \eps$ . If

p

$p$ and

q

$q$ are supported on at most

n

$n$ elements, then our test is successful with high probability provided

m_{1} \geq n^{2 / 3} / ε^{4 / 3}

$m_1\geq n^{2/3}/\varepsilon^{4/3}$ and

m_{2} = Ω (max {\frac{n}{{\sqrt{m}}_{1} ε^{2}}, \frac{\sqrt{n}}{ε^{2}}}) .

$m_2 = \Omega\left(\max\{\frac{n}{\sqrt m_1\varepsilon^2}, \frac{\sqrt n}{\varepsilon^2}\}\right).$ We show that this tradeoff is information theoretically optimal throughout this range, in the dependencies on all parameters,

n, m_{1},

$n,m_1,$ and

\eps

$\eps$ , to constant factors. As a consequence, we obtain an algorithm for estimating the mixing time of a Markov chain on

n

$n$ states up to a

\log n

$\log n$ factor that uses

\tilde{O} (n^{3 / 2} τ_{m i x})

$\tilde{O}(n^{3/2} \tau_{mix})$ queries to a

next node'' oracle. The core of our testing algorithm is a relatively simple statistic that seems to perform well in practice, both on synthetic data and on natural language data. We believe that this statistic might prove to be a useful primitive within larger machine learning and natural language processing systems.

Live content is unavailable. Log in and register to view live content