Abstract:
Based on a sample of size , we consider estimating the number of symbols that appear at least times in an independent sample of size , where is a given parameter. This formulation includes, as a special case, the well-known problem of inferring the number of unseen species introduced by [Fisher et al.] in 1943 and considered by many others. Of considerable interest in this line of works is the largest for which the quantity can be accurately predicted. We completely resolve this problem by determining the limit of estimation to be , with both lower and upper bounds matching up to constant factors. For the particular case of , this implies the recent result by [Orlitsky et al.] on the unseen species problem. Experimental evaluations show that the proposed estimator performs exceptionally well in practice. Furthermore, the estimator is a simple linear combination of symbols' empirical counts, and hence linear-time computable.
Chat is not available.