TY - GEN
T1 - Minimization of Classifier Construction Cost for Search Queries
AU - Gershtein, Shay
AU - Milo, Tova
AU - Morami, Gefen
AU - Novgorodov, Slava
N1 - Publisher Copyright: © 2020 Association for Computing Machinery.
PY - 2020/6/14
Y1 - 2020/6/14
N2 - Search over massive sets of items is the cornerstone of many modern applications. Users express a set of properties and expect the system to retrieve qualifying items. A common difficulty, however, is that the information on whether an item satisfies the search criteria is not explicitly recorded in the repository. Instead, it may be general knowledge or "hidden" in a picture/description, leading to incomplete search results. To overcome this problem, companies build dedicated classifiers that determine which items satisfy the given criteria. However, building classifiers requires volumes of high-quality labeled training data. Since the costs of training classifiers for different subsets of properties can vastly differ, the choice of which classifiers to train has great monetary significance. The goal of our research is to devise effective algorithms to choose which classifiers one should train to address a given query load while minimizing the cost. Previous work considered a simplified model with uniform classifier costs, and queries with two properties. We remove these restrictions in our model. We prove NP-hard inapproximability bounds and devise several algorithms with approximation guarantees. Moreover, we identify a common special case for which we provide an exact algorithm. Our experiments, performed over real-life datasets, demonstrate the effectiveness and efficiency of our algorithms.
AB - Search over massive sets of items is the cornerstone of many modern applications. Users express a set of properties and expect the system to retrieve qualifying items. A common difficulty, however, is that the information on whether an item satisfies the search criteria is not explicitly recorded in the repository. Instead, it may be general knowledge or "hidden" in a picture/description, leading to incomplete search results. To overcome this problem, companies build dedicated classifiers that determine which items satisfy the given criteria. However, building classifiers requires volumes of high-quality labeled training data. Since the costs of training classifiers for different subsets of properties can vastly differ, the choice of which classifiers to train has great monetary significance. The goal of our research is to devise effective algorithms to choose which classifiers one should train to address a given query load while minimizing the cost. Previous work considered a simplified model with uniform classifier costs, and queries with two properties. We remove these restrictions in our model. We prove NP-hard inapproximability bounds and devise several algorithms with approximation guarantees. Moreover, we identify a common special case for which we provide an exact algorithm. Our experiments, performed over real-life datasets, demonstrate the effectiveness and efficiency of our algorithms.
KW - classifiers
KW - classifiers construction cost
KW - e-commerce
KW - search queries
UR - http://www.scopus.com/inward/record.url?scp=85086240011&partnerID=8YFLogxK
U2 - https://doi.org/10.1145/3318464.3389755
DO - https://doi.org/10.1145/3318464.3389755
M3 - منشور من مؤتمر
T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data
SP - 1351
EP - 1365
BT - SIGMOD 2020 - Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
T2 - 2020 ACM SIGMOD International Conference on Management of Data, SIGMOD 2020
Y2 - 14 June 2020 through 19 June 2020
ER -