TY - GEN
T1 - On Coresets for Support Vector Machines
AU - Tukan, Murad
AU - Baykal, Cenk
AU - Feldman, Dan
AU - Rus, Daniela
N1 - Funding Information: This research was supported in part by the U.S. National Science Foundation (NSF) under Awards 1723943 and 1526815, Office of Naval Research (ONR) Grant N00014-18-1-2830, Microsoft, and JP Morgan Chase. M. Tukan and C. Baykal—These authors contributed equally to this work. Publisher Copyright: © 2020, Springer Nature Switzerland AG.
PY - 2020
Y1 - 2020
N2 - We present an efficient coreset construction algorithm for large-scale Support Vector Machine (SVM) training in Big Data and streaming applications. A coreset is a small, representative subset of the original data points such that a models trained on the coreset are provably competitive with those trained on the original data set. Since the size of the coreset is generally much smaller than the original set, our preprocess-then-train scheme has potential to lead to significant speedups when training SVM models. We prove lower and upper bounds on the size of the coreset required to obtain small data summaries for the SVM problem. As a corollary, we show that our algorithm can be used to extend the applicability of any off-the-shelf SVM solver to streaming, distributed, and dynamic data settings. We evaluate the performance of our algorithm on real-world and synthetic data sets. Our experimental results reaffirm the favorable theoretical properties of our algorithm and demonstrate its practical effectiveness in accelerating SVM training.
AB - We present an efficient coreset construction algorithm for large-scale Support Vector Machine (SVM) training in Big Data and streaming applications. A coreset is a small, representative subset of the original data points such that a models trained on the coreset are provably competitive with those trained on the original data set. Since the size of the coreset is generally much smaller than the original set, our preprocess-then-train scheme has potential to lead to significant speedups when training SVM models. We prove lower and upper bounds on the size of the coreset required to obtain small data summaries for the SVM problem. As a corollary, we show that our algorithm can be used to extend the applicability of any off-the-shelf SVM solver to streaming, distributed, and dynamic data settings. We evaluate the performance of our algorithm on real-world and synthetic data sets. Our experimental results reaffirm the favorable theoretical properties of our algorithm and demonstrate its practical effectiveness in accelerating SVM training.
UR - http://www.scopus.com/inward/record.url?scp=85093830601&partnerID=8YFLogxK
U2 - https://doi.org/10.1007/978-3-030-59267-7_25
DO - https://doi.org/10.1007/978-3-030-59267-7_25
M3 - Conference contribution
SN - 9783030592660
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 287
EP - 299
BT - Theory and Applications of Models of Computation - 16th International Conference, TAMC 2020, Proceedings
A2 - Chen, Jianer
A2 - Feng, Qilong
A2 - Xu, Jinhui
PB - Springer Science and Business Media Deutschland GmbH
T2 - 16th Annual Conference on Theory and Applications of Models of Computation, TAMC 2020
Y2 - 18 October 2020 through 20 October 2020
ER -