TY - GEN
T1 - κ-means for streaming and distributed big sparse data
AU - Barger, Artem
AU - Feldman, Dan
N1 - Funding Information: Support for this work has been provided in part by BSF/NSF Grant Number: 2014627 and by GIF 2408-407.6 Young Scientists' Program Contract No.: I-1186-407.9-2014. Publisher Copyright: Copyright © by SIAM.
PY - 2016
Y1 - 2016
N2 - We provide the first streaming algorithm for computing a provable approximation to the κ-means of sparse Big Data. Here, sparse Big Data is a stream of n vectors in ℝd, where each vector has O(1) non-zeroes entries and possibly d ≥ n. E.g., adjacency matrix of a graph, web-links, social network, document-terms, or image-features matrices. Our streaming algorithm stores at most logn κO(1) input points in memory. If the stream is distributed among M machines, the running time reduces by a factor of M, while communicating a total of M κO(1) (sparse) input points between the machines. Our main contribution is a deterministic algorithm for computing a sparse (κ,ϵ)-coreset, which is a weighted subset of κO(1) input points that approximates the sum of squared distances from the n input points to every set of κ centers, up to (1 ± ϵ) factor, for any given constant ϵ > 0. This is the first such coreset of size independent of both d and n. Our experimental results show how our algorithm can bs used to boost the performance of any given κ-means heuristics, even in the off-line setting. Open access to our implementation is also provided.
AB - We provide the first streaming algorithm for computing a provable approximation to the κ-means of sparse Big Data. Here, sparse Big Data is a stream of n vectors in ℝd, where each vector has O(1) non-zeroes entries and possibly d ≥ n. E.g., adjacency matrix of a graph, web-links, social network, document-terms, or image-features matrices. Our streaming algorithm stores at most logn κO(1) input points in memory. If the stream is distributed among M machines, the running time reduces by a factor of M, while communicating a total of M κO(1) (sparse) input points between the machines. Our main contribution is a deterministic algorithm for computing a sparse (κ,ϵ)-coreset, which is a weighted subset of κO(1) input points that approximates the sum of squared distances from the n input points to every set of κ centers, up to (1 ± ϵ) factor, for any given constant ϵ > 0. This is the first such coreset of size independent of both d and n. Our experimental results show how our algorithm can bs used to boost the performance of any given κ-means heuristics, even in the off-line setting. Open access to our implementation is also provided.
KW - Big-Data
KW - Clustering
KW - Coresets
KW - Distributed
KW - Streaming
KW - κ-Means
UR - http://www.scopus.com/inward/record.url?scp=84991691428&partnerID=8YFLogxK
U2 - https://doi.org/10.1137/1.9781611974348.39
DO - https://doi.org/10.1137/1.9781611974348.39
M3 - Conference contribution
T3 - 16th SIAM International Conference on Data Mining 2016, SDM 2016
SP - 342
EP - 350
BT - 16th SIAM International Conference on Data Mining 2016, SDM 2016
A2 - Venkatasubramanian, Sanjay Chawla
A2 - Meira, Wagner
PB - Society for Industrial and Applied Mathematics Publications
T2 - 16th SIAM International Conference on Data Mining 2016, SDM 2016
Y2 - 5 May 2016 through 7 May 2016
ER -