Abstract
People's location data are continuously tracked from various devices and sensors, enabling an ongoing analysis of sensitive information that can violate people's privacy and reveal confidential information. Synthetic data have been used to generate representative location sequences yet to maintain the users' privacy. Nonetheless, the privacy-accuracy tradeoff between these two measures has not been addressed systematically. In this article, we analyze the use of different synthetic data generation models for long location sequences, including extended short-term memory networks (LSTMs), Markov Chains (MC), and variable-order Markov models (VMMs). We employ different performance measures, such as data similarity and privacy, and discuss the inherent tradeoff. Furthermore, we introduce other measurements to quantify each of these measures. Based on the anonymous data of 300 thousand cellular-phone users, our work offers a road map for developing policies for synthetic data generation processes. We propose a framework for building data generation models and evaluating their effectiveness regarding those accuracy and privacy measures.
Original language | English |
---|---|
Article number | 118 |
Journal | ACM Transactions on Knowledge Discovery from Data |
Volume | 16 |
Issue number | 6 |
DOIs | |
State | Published - 30 Jul 2022 |
Keywords
- Synthetic data
- location sequences
- long short term memory network (LSTM)
- privacy
All Science Journal Classification (ASJC) codes
- General Computer Science