Tight Sensitivity Bounds for Smaller Coresets

Alaa Maalouf, Adiel Statman, Dan Feldman

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

An ϵ-coreset to the dimensionality reduction problem for a (possibly very large) matrix A ĝ Rn x d is a small scaled subset of its n rows that approximates their sum of squared distances to every affine k-dimensional subspace of Rd, up to a factor of 1±ϵ. Such a coreset is useful for boosting the running time of computing a low-rank approximation (k-SVD/k-PCA) while using small memory. Coresets are also useful for handling streaming, dynamic and distributed data in parallel. With high probability, non-uniform sampling based on the so called leverage score or sensitivity of each row in A yields a coreset. The size of the (sampled) coreset is then near-linear in the total sum of these sensitivity bounds. We provide algorithms that compute provably tight bounds for the sensitivity of each input row. It is based on two ingredients: (i) iterative algorithm that computes the exact sensitivity of each row up to arbitrary small precision for (non-affine) k-subspaces, and (ii) a general reduction for computing a coreset for affine subspaces, given a coreset for (non-affine) subspaces in Rd. Experimental results on real-world datasets, including the English Wikipedia documents-term matrix, show that our bounds provide significantly smaller and data-dependent coresets also in practice. Full open source code is also provided.

Original languageAmerican English
Title of host publicationKDD 2020 - Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
PublisherAssociation for Computing Machinery
Pages2051-2061
Number of pages11
ISBN (Electronic)9781450379984
DOIs
StatePublished - 23 Aug 2020
Event26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2020 - Virtual, Online, United States
Duration: 23 Aug 202027 Aug 2020

Publication series

NameProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Conference

Conference26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2020
Country/TerritoryUnited States
CityVirtual, Online
Period23/08/2027/08/20

Keywords

  • PCA
  • SVD
  • coreset
  • dimensionality reduction
  • low rank approximation
  • sketch

All Science Journal Classification (ASJC) codes

  • Software
  • Information Systems

Fingerprint

Dive into the research topics of 'Tight Sensitivity Bounds for Smaller Coresets'. Together they form a unique fingerprint.

Cite this