TY - GEN
T1 - Modularity-based query clustering for identifying users sharing a common condition
AU - Harel, Maayan
AU - Yom-Tov, Elad
N1 - Publisher Copyright: © 2015 ACM.
PY - 2015/8/9
Y1 - 2015/8/9
N2 - We present an algorithm for identifying users who share a common condition from anonymized search engine logs. Input to the algorithm is a set of seed phrases that identify users with the condition of interest with high precision albeit at a very low recall. We expand the set of seed phrases by clustering queries according to the pages users clicked following these queries and the temporal ordering of queries within sessions, emphasizing the subgraph containing seed phrases. To this end, we extend modularity-based clustering such that it uses the information in the initial seed phrases as well as other queries of users in the population of interest. We evaluate the performance of the proposed method on two datasets, one of mood disorders and the other of anorexia, by classifying users according to the clusters in which they appeared and the phrases contained thereof, and show that the area under the receiver operating characteristic curve (AUC) obtained by these methods exceeds 0.87. These results demonstrate the value of our algorithm for both identifying users for future research and to gain better understanding of the language associated with the condition.
AB - We present an algorithm for identifying users who share a common condition from anonymized search engine logs. Input to the algorithm is a set of seed phrases that identify users with the condition of interest with high precision albeit at a very low recall. We expand the set of seed phrases by clustering queries according to the pages users clicked following these queries and the temporal ordering of queries within sessions, emphasizing the subgraph containing seed phrases. To this end, we extend modularity-based clustering such that it uses the information in the initial seed phrases as well as other queries of users in the population of interest. We evaluate the performance of the proposed method on two datasets, one of mood disorders and the other of anorexia, by classifying users according to the clusters in which they appeared and the phrases contained thereof, and show that the area under the receiver operating characteristic curve (AUC) obtained by these methods exceeds 0.87. These results demonstrate the value of our algorithm for both identifying users for future research and to gain better understanding of the language associated with the condition.
UR - http://www.scopus.com/inward/record.url?scp=84953775518&partnerID=8YFLogxK
U2 - https://doi.org/10.1145/2766462.2767798
DO - https://doi.org/10.1145/2766462.2767798
M3 - منشور من مؤتمر
T3 - SIGIR 2015 - Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval
SP - 819
EP - 822
BT - SIGIR 2015 - Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval
T2 - 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2015
Y2 - 9 August 2015 through 13 August 2015
ER -