Cardinality Estimation Meets Good-Turing

Reuven Cohen, Liran Katzir, Aviv Yehezkel

Research output: Contribution to journalArticlepeer-review

Abstract

Cardinality estimation algorithms receive a stream of elements whose order might be arbitrary, with possible repetitions, and return the number of distinct elements. Such algorithms usually seek to minimize the required storage and processing at the price of inaccuracy in their output. Real-world applications of these algorithms are required to process large volumes of monitored data, making it impractical to collect and analyze the entire input stream. In such cases, it is common practice to sample and process only a small part of the stream elements. This paper presents and analyzes a generic algorithm for combining every cardinality estimation algorithm with a sampling process. We show that the proposed sampling algorithm does not affect the estimator's asymptotic unbiasedness, and we analyze the sampling effect on the estimator's variance.

Original languageEnglish
Pages (from-to)1-8
Number of pages8
JournalBig Data Research
Volume9
DOIs
StatePublished - Sep 2017

Keywords

  • Cardinality estimation
  • Data sketch
  • Good-Turing
  • HyperLogLog
  • Sampling

All Science Journal Classification (ASJC) codes

  • Management Information Systems
  • Information Systems
  • Computer Science Applications
  • Information Systems and Management

Fingerprint

Dive into the research topics of 'Cardinality Estimation Meets Good-Turing'. Together they form a unique fingerprint.

Cite this