Abstract
Consider a finite sample from an unknown distribution over a countable alphabet. The occupancy probability (OP) refers to the total probability of symbols that appear exactly k times in the sample. Estimating the OP is a basic problem in large alphabet modeling, with a variety of applications in machine learning, statistics and information theory. The Good-Turing (GT) framework is perhaps the most popular OP estimation scheme. Classical results show that the GT estimator converges to the OP, for every k independently. In this work we introduce new exact convergence guarantees for the GT estimator, based on worst-case mean squared error analysis. Our scheme improves upon currently known results. Further, we introduce a novel simultaneous convergence rate, for any desired set of occupancy probabilities. This allows us to quantify the unified performance of OP estimators, and introduce a novel estimation framework with favorable convergence guarantees.
| Original language | English |
|---|---|
| Article number | 279 |
| Pages (from-to) | 12774–12810 |
| Journal | Journal of Machine Learning Research |
| Volume | 23 |
| State | Published - 1 Sep 2022 |
Keywords
- Good-Turing Estimator
- Missing Mass
- Natural Language Modeling
- Occupancy Probability
All Science Journal Classification (ASJC) codes
- Control and Systems Engineering
- Software
- Statistics and Probability
- Artificial Intelligence
Fingerprint
Dive into the research topics of 'Convergence Guarantees for the Good-Turing Estimator'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver