NeurIPS 2020 : Optimal Prediction of the Number of Unseen Species with Multiplicity



Optimal Prediction of the Number of Unseen Species with Multiplicity

Yi Hao, Ping Li

Spotlight presentation: Orals & Spotlights Track 24: Learning Theory
on Thu, Dec 10th, 2020 @ 04:20 – 04:30 GMT

Poster Session 5 (more posters)
on Thu, Dec 10th, 2020 @ 05:00 – 07:00 GMT
GatherTown: Core machine learning & Theory ( Town A0 - Spot B2 )

Join GatherTown
Only iff poster is crowded, join Zoom . Authors have to start the Zoom call from their Profile page / Presentation History.

Toggle Abstract Paper (in Proceedings / .pdf)

Abstract: Based on a sample of size $n$, we consider estimating the number of symbols that appear at least $\mu$ times in an independent sample of size $a \cdot n$, where $a$ is a given parameter. This formulation includes, as a special case, the well-known problem of inferring the number of unseen species introduced by [Fisher et al.] in 1943 and considered by many others. Of considerable interest in this line of works is the largest $a$ for which the quantity can be accurately predicted. We completely resolve this problem by determining the limit of estimation to be $a \approx (\log n)/\mu$, with both lower and upper bounds matching up to constant factors. For the particular case of $\mu = 1$, this implies the recent result by [Orlitsky et al.] on the unseen species problem. Experimental evaluations show that the proposed estimator performs exceptionally well in practice. Furthermore, the estimator is a simple linear combination of symbols' empirical counts, and hence linear-time computable.

Optimal Prediction of the Number of Unseen Species with Multiplicity

Yi Hao, Ping Li

Preview Video and Chat

Chat is not available.