Skip to main navigation Skip to search Skip to main content

Datamap-Driven Tabular Coreset Selection for Classifier Training

Aviv Hadar, Tova Milo, Kathy Razmadze

Research output: Contribution to journalConference articlepeer-review

Abstract

In the era of data-driven decision-making, efficient machine learning model training is crucial. We present a novel algorithm for constructing tabular data coresets using data maps created for Gradient Boosting Decision Trees models. The resulting coresets, computed within minutes, consistently outperform other baselines and matchor exceed the performance of models trained on the entire dataset. Additionally, a training enhancement method leveraging datamap in sights during the inference phase improves performance with mathematical guarantees, given a defined property holds. An explainability layer and tools for coreset size optimization furtherenhance the efficiency of training tabular machine learning models.

Original languageEnglish
Pages (from-to)876-888
Number of pages13
JournalProceedings of the VLDB Endowment
Volume18
Issue number3
DOIs
StatePublished - 2025
Event51st International Conference on Very Large Data Bases, VLDB 2025 - London, United Kingdom
Duration: 1 Sep 20255 Sep 2025

All Science Journal Classification (ASJC) codes

  • Computer Science (miscellaneous)
  • General Computer Science

Fingerprint

Dive into the research topics of 'Datamap-Driven Tabular Coreset Selection for Classifier Training'. Together they form a unique fingerprint.

Cite this