Sets Clustering

Ibrahim Jubran, Murad Tukan, Alaa Maalouf, Dan Feldman

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review


The input to the sets-k-means problem is an integer k 1 and a set P = fP1; Png of fixed sized sets in Rd. The goal is to compute a set C of k centers (points) in Rd that minimizes the sum P P2P minp2P;c2C kp-ck2 of squared distances to these sets. An "-core-set for this problem is a weighted subset of P that approximates this sum up to 1 " factor, for every set C of k centers in Rd. We prove that such a core-set of O(log2 n) sets always exists, and can be computed in O(n log n) time, for every input P and every fixed d; k 1 and " 2 (0; 1). The result easily generalized for any metric space, distances to the power of z 0, and M-estimators that handle outliers. Applying an inefficient but optimal algorithm on this coreset allows us to obtain the first PTAS (1 + " approximation) for the sets-k-means problem that takes time near linear in n. This is the first result even for sets-mean on the plane (k = 1, d = 2). Open source code and experimental results for document classification and facility locations are also provided.

Original languageEnglish
Title of host publication37th International Conference on Machine Learning, ICML 2020
EditorsHal Daume, Aarti Singh
Number of pages12
ISBN (Electronic)9781713821120
StatePublished - 2020
Event37th International Conference on Machine Learning, ICML 2020 - Virtual, Online
Duration: 13 Jul 202018 Jul 2020

Publication series

Name37th International Conference on Machine Learning, ICML 2020


Conference37th International Conference on Machine Learning, ICML 2020
CityVirtual, Online

All Science Journal Classification (ASJC) codes

  • Computational Theory and Mathematics
  • Human-Computer Interaction
  • Software


Dive into the research topics of 'Sets Clustering'. Together they form a unique fingerprint.

Cite this