TY - GEN
T1 - Unsupervised topic extraction from privacy policies
AU - Sarne, David
AU - Schler, Jonathan
AU - Singer, Alon
AU - Sela, Ayelet
AU - Tov, Ittai Bar Siman
N1 - Publisher Copyright: � 2019 IW3C2 (International World Wide Web Conference Committee), published under Creative Commons CC-BY-NC-ND 4.0 License.
PY - 2019/5/13
Y1 - 2019/5/13
N2 - This paper suggests the use of automatic topic modeling for large-scale corpora of privacy policies using unsupervised learning techniques. The advantages of using unsupervised learning for this task are numerous. The primary advantages include the ability to analyze any new corpus with a fraction of the effort required by supervised learning, the ability to study changes in topics of interest along time, and the ability to identify finer-grained topics of interest in these privacy policies. Based on general principles of document analysis we synthesize a cohesive framework for privacy policy topic modeling and apply it over a corpus of 4,982 privacy policies of mobile applications crawled from the Google Play Store. The results demonstrate that even with this relatively moderate-size corpus quite comprehensive insights can be attained regarding the focus and scope of current privacy policy documents. The topics extracted, their structure and the applicability of the unsupervised approach for that matter are validated through an extensive comparison to similar findings reported in prior work that uses supervised learning (which heavily depends on manual annotation of experts). The comparison suggests a substantial overlap between the topics found and those reported in prior work, and also unveils some new topics of interest.
AB - This paper suggests the use of automatic topic modeling for large-scale corpora of privacy policies using unsupervised learning techniques. The advantages of using unsupervised learning for this task are numerous. The primary advantages include the ability to analyze any new corpus with a fraction of the effort required by supervised learning, the ability to study changes in topics of interest along time, and the ability to identify finer-grained topics of interest in these privacy policies. Based on general principles of document analysis we synthesize a cohesive framework for privacy policy topic modeling and apply it over a corpus of 4,982 privacy policies of mobile applications crawled from the Google Play Store. The results demonstrate that even with this relatively moderate-size corpus quite comprehensive insights can be attained regarding the focus and scope of current privacy policy documents. The topics extracted, their structure and the applicability of the unsupervised approach for that matter are validated through an extensive comparison to similar findings reported in prior work that uses supervised learning (which heavily depends on manual annotation of experts). The comparison suggests a substantial overlap between the topics found and those reported in prior work, and also unveils some new topics of interest.
KW - Privacy policies
KW - Topic modeling
KW - Unsuprevised learning
UR - http://www.scopus.com/inward/record.url?scp=85066886548&partnerID=8YFLogxK
U2 - 10.1145/3308560.3317585
DO - 10.1145/3308560.3317585
M3 - منشور من مؤتمر
T3 - The Web Conference 2019 - Companion of the World Wide Web Conference, WWW 2019
SP - 563
EP - 568
BT - The Web Conference 2019 - Companion of the World Wide Web Conference, WWW 2019
T2 - 2019 World Wide Web Conference, WWW 2019
Y2 - 13 May 2019 through 17 May 2019
ER -