TY - GEN
T1 - Inferring individual attributes from search engine queries and auxiliary information
AU - Soldaini, Luca
AU - Yom-Tov, Elad
N1 - Publisher Copyright: © 2017 International World Wide Web Conference Committee (IW3C2).
PY - 2017
Y1 - 2017
N2 - Internet data has surfaced as a primary source for investigation of different aspects of human behavior. A crucial step in such studies is finding a suitable cohort (i.e., a set of users) that shares a common trait of interest to researchers. However, direct identification of users sharing this trait is often impossible, as the data available to researchers is usually anonymized to preserve user privacy. To facilitate research on specific topics of interest, especially in medicine, we introduce an algorithm for identifying a trait of interest in anonymous users. We illustrate how a small set of labeled examples, together with statistical information about the entire population, can be aggregated to obtain labels on unseen examples. We validate our approach using labeled data from the political domain. We provide two applications of the proposed algorithm to the medical domain. In the first, we demonstrate how to identify users whose search patterns indicate they might be suffering from certain types of cancer. This shows, for the first time, that search queries can be used as a screening device for diseases that are currently often discovered too late, because no early screening tests exists. In the second, we detail an algorithm to predict the distribution of diseases given their incidence in a subset of the population at study, making it possible to predict disease spread from partial epidemiological data.
AB - Internet data has surfaced as a primary source for investigation of different aspects of human behavior. A crucial step in such studies is finding a suitable cohort (i.e., a set of users) that shares a common trait of interest to researchers. However, direct identification of users sharing this trait is often impossible, as the data available to researchers is usually anonymized to preserve user privacy. To facilitate research on specific topics of interest, especially in medicine, we introduce an algorithm for identifying a trait of interest in anonymous users. We illustrate how a small set of labeled examples, together with statistical information about the entire population, can be aggregated to obtain labels on unseen examples. We validate our approach using labeled data from the political domain. We provide two applications of the proposed algorithm to the medical domain. In the first, we demonstrate how to identify users whose search patterns indicate they might be suffering from certain types of cancer. This shows, for the first time, that search queries can be used as a screening device for diseases that are currently often discovered too late, because no early screening tests exists. In the second, we detail an algorithm to predict the distribution of diseases given their incidence in a subset of the population at study, making it possible to predict disease spread from partial epidemiological data.
KW - Disease screening
KW - Health informatics
KW - Query log analysis
UR - http://www.scopus.com/inward/record.url?scp=85041079867&partnerID=8YFLogxK
U2 - https://doi.org/10.1145/3038912.3052629
DO - https://doi.org/10.1145/3038912.3052629
M3 - منشور من مؤتمر
SN - 9781450349130
T3 - 26th International World Wide Web Conference, WWW 2017
SP - 293
EP - 302
BT - 26th International World Wide Web Conference, WWW 2017
T2 - 26th International World Wide Web Conference, WWW 2017
Y2 - 3 April 2017 through 7 April 2017
ER -