NeurIPS 2020 : Optimal Prediction of the Number of Unseen Species with Multiplicity



Optimal Prediction of the Number of Unseen Species with Multiplicity

Yi Hao, Ping Li

Spotlight presentation: Orals & Spotlights Track 24: Learning Theory
on 2020-12-09T20:20:00-08:00 - 2020-12-09T20:30:00-08:00

Poster Session 5 (more posters)
on 2020-12-09T21:00:00-08:00 - 2020-12-09T23:00:00-08:00

Toggle Abstract Paper (in Proceedings / .pdf)

Abstract: Based on a sample of size $n$, we consider estimating the number of symbols that appear at least $\mu$ times in an independent sample of size $a \cdot n$, where $a$ is a given parameter. This formulation includes, as a special case, the well-known problem of inferring the number of unseen species introduced by [Fisher et al.] in 1943 and considered by many others. Of considerable interest in this line of works is the largest $a$ for which the quantity can be accurately predicted. We completely resolve this problem by determining the limit of estimation to be $a \approx (\log n)/\mu$, with both lower and upper bounds matching up to constant factors. For the particular case of $\mu = 1$, this implies the recent result by [Orlitsky et al.] on the unseen species problem. Experimental evaluations show that the proposed estimator performs exceptionally well in practice. Furthermore, the estimator is a simple linear combination of symbols' empirical counts, and hence linear-time computable.

Optimal Prediction of the Number of Unseen Species with Multiplicity

Yi Hao, Ping Li

Preview Video and Chat

To see video, interact with the author and ask questions please use registration and login.