TY - GEN
T1 - Classifier Construction Under Budget Constraints
AU - Gershtein, Shay
AU - Milo, Tova
AU - Novgorodov, Slava
AU - Razmadze, Kathy
N1 - Publisher Copyright: © 2022 ACM.
PY - 2022/6/10
Y1 - 2022/6/10
N2 - Search mechanisms over large assortments of items are central to the operation of many platforms. As users commonly express filtering conditions based on item properties that are not initially stored, companies must derive the missing information by training and applying binary classifiers. Choosing which classifiers to construct is however not trivial, since classifiers differ in construction costs and range of applicability. Previous work has considered the problem of selecting a classifier set of minimum construction cost, but this has been done under the (often unrealistic) assumption that the available budget is unlimited and allows to support all search queries. In practice, budget constraints require prioritizing some queries over others. To capture this consideration, we study in this work a more general model that allows assigning to each search query a score that models how important it is to compute its result set and examine the optimization problem of selecting a classifier set, whose cost is within the budget, that maximizes the overall score of the queries it can answer. We show that this generalization is likely much harder to approximate complexity-wise, even assuming limited special cases. Nevertheless, we devise a heuristic algorithm, whose effectiveness is demonstrated in our experimental study over real-world data, consisting of a public dataset and datasets provided by a large e-commerce company that include costs and scores derived by business analysts. Finally, we show that our methods are applicable also for related problems in practical settings where there is some flexibility in determining the budget.
AB - Search mechanisms over large assortments of items are central to the operation of many platforms. As users commonly express filtering conditions based on item properties that are not initially stored, companies must derive the missing information by training and applying binary classifiers. Choosing which classifiers to construct is however not trivial, since classifiers differ in construction costs and range of applicability. Previous work has considered the problem of selecting a classifier set of minimum construction cost, but this has been done under the (often unrealistic) assumption that the available budget is unlimited and allows to support all search queries. In practice, budget constraints require prioritizing some queries over others. To capture this consideration, we study in this work a more general model that allows assigning to each search query a score that models how important it is to compute its result set and examine the optimization problem of selecting a classifier set, whose cost is within the budget, that maximizes the overall score of the queries it can answer. We show that this generalization is likely much harder to approximate complexity-wise, even assuming limited special cases. Nevertheless, we devise a heuristic algorithm, whose effectiveness is demonstrated in our experimental study over real-world data, consisting of a public dataset and datasets provided by a large e-commerce company that include costs and scores derived by business analysts. Finally, we show that our methods are applicable also for related problems in practical settings where there is some flexibility in determining the budget.
KW - attributes extraction
KW - classifier construction
KW - data completion
UR - http://www.scopus.com/inward/record.url?scp=85132694983&partnerID=8YFLogxK
U2 - https://doi.org/10.1145/3514221.3517863
DO - https://doi.org/10.1145/3514221.3517863
M3 - منشور من مؤتمر
T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data
SP - 1160
EP - 1174
BT - SIGMOD 2022 - Proceedings of the 2022 International Conference on Management of Data
T2 - 2022 ACM SIGMOD International Conference on the Management of Data, SIGMOD 2022
Y2 - 12 June 2022 through 17 June 2022
ER -