Abstract
In the era of data-driven decision-making, efficient machine learning model training is crucial. We present a novel algorithm for constructing tabular data coresets using data maps created for Gradient Boosting Decision Trees models. The resulting coresets, computed within minutes, consistently outperform other baselines and matchor exceed the performance of models trained on the entire dataset. Additionally, a training enhancement method leveraging datamap in sights during the inference phase improves performance with mathematical guarantees, given a defined property holds. An explainability layer and tools for coreset size optimization furtherenhance the efficiency of training tabular machine learning models.
| Original language | English |
|---|---|
| Pages (from-to) | 876-888 |
| Number of pages | 13 |
| Journal | Proceedings of the VLDB Endowment |
| Volume | 18 |
| Issue number | 3 |
| DOIs | |
| State | Published - 2025 |
| Event | 51st International Conference on Very Large Data Bases, VLDB 2025 - London, United Kingdom Duration: 1 Sep 2025 → 5 Sep 2025 |
All Science Journal Classification (ASJC) codes
- Computer Science (miscellaneous)
- General Computer Science
Fingerprint
Dive into the research topics of 'Datamap-Driven Tabular Coreset Selection for Classifier Training'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver