Streaming coreset constructions for M-estimators

Vladimir Braverman, Dan Feldman, Harry Lang, Daniela Rus

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

We introduce a new method of maintaining a (k, ϵ)-coreset for clustering M-estimators over insertion-only streams. Let (P, w) be a weighted set (where w : P → [0, ∞) is the weight function) of points in a ρ-metric space (meaning a set X equipped with a positive-semidefinite symmetric function D such that D(x, z) ≤ ρ(D(x, y) + D(y, z)) for all x, y, z ∈ X). For any set of points C, we define COST(P, w, C) = ∑p∈P w(p) minc∈C D(p, c). A (k, ϵ)-coreset for (P, w) is a weighted set (Q, v) such that for every set C of k points, (1 − ϵ)COST(P, w, C) ≤ COST(Q, v, C) ≤ (1 + ϵ)COST(P, w, C). Essentially, the coreset (Q, v) can be used in place of (P, w) for all operations concerning the COST function. Coresets, as a method of data reduction, are used to solve fundamental problems in machine learning of streaming and distributed data. M-estimators are functions D(x, y) that can be written as ψ(d(x, y)) where (X, d) is a true metric (i.e. 1-metric) space. Special cases of M-estimators include the well-known k-median (ψ(x) = x) and k-means (ψ(x) = x2) functions. Our technique takes an existing offline construction for an M-estimator coreset and converts it into the streaming setting, where n data points arrive sequentially. To our knowledge, this is the first streaming construction for any M-estimator that does not rely on the merge-and-reduce tree. For example, our coreset for streaming metric k-means uses O(ϵ−2k log k log n) points of storage. The previous state-of-the-art required storing at least O(ϵ−2k log k log4 n) points.

Original languageAmerican English
Title of host publicationApproximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM 2019
EditorsDimitris Achlioptas, Laszlo A. Vegh
PublisherSchloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing
Pages62:1—62:16
ISBN (Electronic)9783959771252
DOIs
StatePublished - Sep 2019
Event22nd International Conference on Approximation Algorithms for Combinatorial Optimization Problems and 23rd International Conference on Randomization and Computation, APPROX/RANDOM 2019 - Cambridge, United States
Duration: 20 Sep 201922 Sep 2019

Publication series

NameLeibniz International Proceedings in Informatics, LIPIcs
Volume145

Conference

Conference22nd International Conference on Approximation Algorithms for Combinatorial Optimization Problems and 23rd International Conference on Randomization and Computation, APPROX/RANDOM 2019
Country/TerritoryUnited States
CityCambridge
Period20/09/1922/09/19

Keywords

  • Clustering
  • Coresets
  • Streaming

All Science Journal Classification (ASJC) codes

  • Software

Fingerprint

Dive into the research topics of 'Streaming coreset constructions for M-estimators'. Together they form a unique fingerprint.

Cite this