TY - GEN
T1 - Streaming coreset constructions for M-estimators
AU - Braverman, Vladimir
AU - Feldman, Dan
AU - Lang, Harry
AU - Rus, Daniela
N1 - Funding Information: Funding Vladimir Braverman: This research was supported in part by NSF CAREER grant 1652257, ONR Award N00014-18-1-2364, DARPA/ARO award W911NF1820267. Harry Lang: This material is based upon work supported by the Franco-American Fulbright Commission. The author thanks INRIA (l’Institut national de recherche en informatique et en automatique) for hosting him during part of the writing of this paper. Daniela Rus: This research was supported in part by NSF 1723943, NVIDIA, and J.P. Morgan Chase & Co. Publisher Copyright: © Vladimir Braverman, Dan Feldman, Harry Lang, and Daniela Rus.
PY - 2019/9
Y1 - 2019/9
N2 - We introduce a new method of maintaining a (k, ϵ)-coreset for clustering M-estimators over insertion-only streams. Let (P, w) be a weighted set (where w : P → [0, ∞) is the weight function) of points in a ρ-metric space (meaning a set X equipped with a positive-semidefinite symmetric function D such that D(x, z) ≤ ρ(D(x, y) + D(y, z)) for all x, y, z ∈ X). For any set of points C, we define COST(P, w, C) = ∑p∈P w(p) minc∈C D(p, c). A (k, ϵ)-coreset for (P, w) is a weighted set (Q, v) such that for every set C of k points, (1 − ϵ)COST(P, w, C) ≤ COST(Q, v, C) ≤ (1 + ϵ)COST(P, w, C). Essentially, the coreset (Q, v) can be used in place of (P, w) for all operations concerning the COST function. Coresets, as a method of data reduction, are used to solve fundamental problems in machine learning of streaming and distributed data. M-estimators are functions D(x, y) that can be written as ψ(d(x, y)) where (X, d) is a true metric (i.e. 1-metric) space. Special cases of M-estimators include the well-known k-median (ψ(x) = x) and k-means (ψ(x) = x2) functions. Our technique takes an existing offline construction for an M-estimator coreset and converts it into the streaming setting, where n data points arrive sequentially. To our knowledge, this is the first streaming construction for any M-estimator that does not rely on the merge-and-reduce tree. For example, our coreset for streaming metric k-means uses O(ϵ−2k log k log n) points of storage. The previous state-of-the-art required storing at least O(ϵ−2k log k log4 n) points.
AB - We introduce a new method of maintaining a (k, ϵ)-coreset for clustering M-estimators over insertion-only streams. Let (P, w) be a weighted set (where w : P → [0, ∞) is the weight function) of points in a ρ-metric space (meaning a set X equipped with a positive-semidefinite symmetric function D such that D(x, z) ≤ ρ(D(x, y) + D(y, z)) for all x, y, z ∈ X). For any set of points C, we define COST(P, w, C) = ∑p∈P w(p) minc∈C D(p, c). A (k, ϵ)-coreset for (P, w) is a weighted set (Q, v) such that for every set C of k points, (1 − ϵ)COST(P, w, C) ≤ COST(Q, v, C) ≤ (1 + ϵ)COST(P, w, C). Essentially, the coreset (Q, v) can be used in place of (P, w) for all operations concerning the COST function. Coresets, as a method of data reduction, are used to solve fundamental problems in machine learning of streaming and distributed data. M-estimators are functions D(x, y) that can be written as ψ(d(x, y)) where (X, d) is a true metric (i.e. 1-metric) space. Special cases of M-estimators include the well-known k-median (ψ(x) = x) and k-means (ψ(x) = x2) functions. Our technique takes an existing offline construction for an M-estimator coreset and converts it into the streaming setting, where n data points arrive sequentially. To our knowledge, this is the first streaming construction for any M-estimator that does not rely on the merge-and-reduce tree. For example, our coreset for streaming metric k-means uses O(ϵ−2k log k log n) points of storage. The previous state-of-the-art required storing at least O(ϵ−2k log k log4 n) points.
KW - Clustering
KW - Coresets
KW - Streaming
UR - http://www.scopus.com/inward/record.url?scp=85072859236&partnerID=8YFLogxK
U2 - https://doi.org/10.4230/LIPIcs.APPROX-RANDOM.2019.62
DO - https://doi.org/10.4230/LIPIcs.APPROX-RANDOM.2019.62
M3 - Conference contribution
T3 - Leibniz International Proceedings in Informatics, LIPIcs
SP - 62:1—62:16
BT - Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM 2019
A2 - Achlioptas, Dimitris
A2 - Vegh, Laszlo A.
PB - Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing
T2 - 22nd International Conference on Approximation Algorithms for Combinatorial Optimization Problems and 23rd International Conference on Randomization and Computation, APPROX/RANDOM 2019
Y2 - 20 September 2019 through 22 September 2019
ER -