TY - GEN
T1 - Turning Big data into tiny data
T2 - 24th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2013
AU - Feldman, Dan
AU - Schmidt, Melanie
AU - Sohler, Christian
PY - 2013
Y1 - 2013
N2 - We prove that the sum of the squared Euclidean distances from the n rows of an n x d matrix A to any compact set that is spanned by k vectors in double-struck Rd can be approximated up to (1 + ε)-factor, for an arbitrary small ε > 0, using the O(k/ε2)-rank approximation of A and a constant. This implies, for example, that the optimal k-means clustering of the rows of A is (1 + ε)-approximated by an optimal k-means clustering of their projection on the O(k/ε2) first right singular vectors (principle components) of A. A (j, k)-coreset for projective clustering is a small set of points that yields a (1 + ε)-approximation to the sum of squared distances from the n rows of A to any set of k affine subspaces, each of dimension at most j. Our embedding yields (0, k)-coresets of size script O(k) for handling k-means queries, (j, 1)-coresets of size script O(j) for PCA queries, and (j, k)-coresets of size (log n)script O(jk) for any j,k ≥ 1 and constant ε ∈ (0, 1/2). Previous coresets usually have a size which is linearly or even exponentially dependent of d, which makes them useless when d ∼ n. Using our coresets with the merge-and-reduce approach, we obtain embarrassingly parallel streaming algorithms for problems such as k-means, PCA and projective clustering. These algorithms use update time per point and memory that is polynomial in log n and only linear in d. For cost functions other than squared Euclidean distances we suggest a simple recursive coreset construction that produces coresets of size k1/εscript O(1) for k-means and a special class of bregman divergences that is less dependent on the properties of the squared Euclidean distance.
AB - We prove that the sum of the squared Euclidean distances from the n rows of an n x d matrix A to any compact set that is spanned by k vectors in double-struck Rd can be approximated up to (1 + ε)-factor, for an arbitrary small ε > 0, using the O(k/ε2)-rank approximation of A and a constant. This implies, for example, that the optimal k-means clustering of the rows of A is (1 + ε)-approximated by an optimal k-means clustering of their projection on the O(k/ε2) first right singular vectors (principle components) of A. A (j, k)-coreset for projective clustering is a small set of points that yields a (1 + ε)-approximation to the sum of squared distances from the n rows of A to any set of k affine subspaces, each of dimension at most j. Our embedding yields (0, k)-coresets of size script O(k) for handling k-means queries, (j, 1)-coresets of size script O(j) for PCA queries, and (j, k)-coresets of size (log n)script O(jk) for any j,k ≥ 1 and constant ε ∈ (0, 1/2). Previous coresets usually have a size which is linearly or even exponentially dependent of d, which makes them useless when d ∼ n. Using our coresets with the merge-and-reduce approach, we obtain embarrassingly parallel streaming algorithms for problems such as k-means, PCA and projective clustering. These algorithms use update time per point and memory that is polynomial in log n and only linear in d. For cost functions other than squared Euclidean distances we suggest a simple recursive coreset construction that produces coresets of size k1/εscript O(1) for k-means and a special class of bregman divergences that is less dependent on the properties of the squared Euclidean distance.
UR - http://www.scopus.com/inward/record.url?scp=84876035763&partnerID=8YFLogxK
U2 - 10.1137/1.9781611973105.103
DO - 10.1137/1.9781611973105.103
M3 - Conference contribution
SN - 9781611972511
T3 - Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms
SP - 1434
EP - 1453
BT - Proceedings of the 24th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2013
PB - Association for Computing Machinery
Y2 - 6 January 2013 through 8 January 2013
ER -