Abstract
We consider the problem of estimating the number of distinct elements in a large data set (or, equivalently, the support size of the distribution induced by the data set) from a random sample of its elements. The problem occurs in many applications, including biology, genomics, computer systems and linguistics. A line of research spanning the last decade resulted in algorithms that estimate the support up to ±εn from a sample of size O(log2(1/ε) · n/log n), where n is the data set size. Unfortunately, this bound is known to be tight, limiting further improvements to the complexity of this problem. In this paper we consider estimation algorithms augmented with a machine-learning-based predictor that, given any element, returns an estimation of its frequency. We show that if the predictor is correct up to a constant approximation factor, then the sample complexity can be reduced significantly, to log(1/ε) · n1−Θ(1/log(1/ε)). We evaluate the proposed algorithms on a collection of data sets, using the neural-network based estimators from Hsu et al, ICLR'19 as predictors. Our experiments demonstrate substantial (up to 3x) improvements in the estimation accuracy compared to the state of the art algorithm.
| Original language | English |
|---|---|
| State | Published - 2021 |
| Externally published | Yes |
| Event | 9th International Conference on Learning Representations, ICLR 2021 - Virtual, Online Duration: 3 May 2021 → 7 May 2021 |
Conference
| Conference | 9th International Conference on Learning Representations, ICLR 2021 |
|---|---|
| City | Virtual, Online |
| Period | 3/05/21 → 7/05/21 |
All Science Journal Classification (ASJC) codes
- Language and Linguistics
- Computer Science Applications
- Education
- Linguistics and Language
Fingerprint
Dive into the research topics of 'LEARNING-BASED SUPPORT ESTIMATION IN SUBLINEAR TIME'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver