Training Gaussian mixture models at scale via coresets

Mario Lucic, Matthew Faulkner, Andreas Krause, Dan Feldman

Research output: Contribution to journalArticlepeer-review

Abstract

How can we train a statistical mixture model on a massive data set? In this work we show how to construct coresets for mixtures of Gaussians. A coreset is a weighted subset of the data, which guarantees that models fitting the coreset also provide a good fit for the original data set. We show that, perhaps surprisingly, Gaussian mixtures admit coresets of size polynomial in dimension and the number of mixture components, while being independent of the data set size. Hence, one can harness computationally intensive algorithms to compute a good approximation on a significantly smaller data set. More importantly, such coresets can be efficiently constructed both in distributed and streaming settings and do not impose restrictions on the data generating process. Our results rely on a novel reduction of statistical estimation to problems in computational geometry and new combinatorial complexity results for mixtures of Gaussians. Empirical evaluation on several real-world data sets suggests that our coreset-based approach enables significant reduction in training-time with negligible approximation error.

Original languageAmerican English
Pages (from-to)1-25
Number of pages25
JournalJournal of Machine Learning Research
Volume18
StatePublished - 1 May 2018

Keywords

  • Coresets
  • Distributed computation
  • Gaussian mixture models
  • Streaming

All Science Journal Classification (ASJC) codes

  • Software
  • Artificial Intelligence
  • Control and Systems Engineering
  • Statistics and Probability

Fingerprint

Dive into the research topics of 'Training Gaussian mixture models at scale via coresets'. Together they form a unique fingerprint.

Cite this