TY - GEN

T1 - Turning Big data into tiny data

T2 - 24th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2013

AU - Feldman, Dan

AU - Schmidt, Melanie

AU - Sohler, Christian

PY - 2013

Y1 - 2013

N2 - We prove that the sum of the squared Euclidean distances from the n rows of an n x d matrix A to any compact set that is spanned by k vectors in double-struck Rd can be approximated up to (1 + ε)-factor, for an arbitrary small ε > 0, using the O(k/ε2)-rank approximation of A and a constant. This implies, for example, that the optimal k-means clustering of the rows of A is (1 + ε)-approximated by an optimal k-means clustering of their projection on the O(k/ε2) first right singular vectors (principle components) of A. A (j, k)-coreset for projective clustering is a small set of points that yields a (1 + ε)-approximation to the sum of squared distances from the n rows of A to any set of k affine subspaces, each of dimension at most j. Our embedding yields (0, k)-coresets of size script O(k) for handling k-means queries, (j, 1)-coresets of size script O(j) for PCA queries, and (j, k)-coresets of size (log n)script O(jk) for any j,k ≥ 1 and constant ε ∈ (0, 1/2). Previous coresets usually have a size which is linearly or even exponentially dependent of d, which makes them useless when d ∼ n. Using our coresets with the merge-and-reduce approach, we obtain embarrassingly parallel streaming algorithms for problems such as k-means, PCA and projective clustering. These algorithms use update time per point and memory that is polynomial in log n and only linear in d. For cost functions other than squared Euclidean distances we suggest a simple recursive coreset construction that produces coresets of size k1/εscript O(1) for k-means and a special class of bregman divergences that is less dependent on the properties of the squared Euclidean distance.

AB - We prove that the sum of the squared Euclidean distances from the n rows of an n x d matrix A to any compact set that is spanned by k vectors in double-struck Rd can be approximated up to (1 + ε)-factor, for an arbitrary small ε > 0, using the O(k/ε2)-rank approximation of A and a constant. This implies, for example, that the optimal k-means clustering of the rows of A is (1 + ε)-approximated by an optimal k-means clustering of their projection on the O(k/ε2) first right singular vectors (principle components) of A. A (j, k)-coreset for projective clustering is a small set of points that yields a (1 + ε)-approximation to the sum of squared distances from the n rows of A to any set of k affine subspaces, each of dimension at most j. Our embedding yields (0, k)-coresets of size script O(k) for handling k-means queries, (j, 1)-coresets of size script O(j) for PCA queries, and (j, k)-coresets of size (log n)script O(jk) for any j,k ≥ 1 and constant ε ∈ (0, 1/2). Previous coresets usually have a size which is linearly or even exponentially dependent of d, which makes them useless when d ∼ n. Using our coresets with the merge-and-reduce approach, we obtain embarrassingly parallel streaming algorithms for problems such as k-means, PCA and projective clustering. These algorithms use update time per point and memory that is polynomial in log n and only linear in d. For cost functions other than squared Euclidean distances we suggest a simple recursive coreset construction that produces coresets of size k1/εscript O(1) for k-means and a special class of bregman divergences that is less dependent on the properties of the squared Euclidean distance.

UR - http://www.scopus.com/inward/record.url?scp=84876035763&partnerID=8YFLogxK

U2 - https://doi.org/10.1137/1.9781611973105.103

DO - https://doi.org/10.1137/1.9781611973105.103

M3 - Conference contribution

SN - 9781611972511

T3 - Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms

SP - 1434

EP - 1453

BT - Proceedings of the 24th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2013

Y2 - 6 January 2013 through 8 January 2013

ER -