Analyses of people’s medical, behavioral, and sociological data are essential for understanding the pandemic situation and devising remedial measures. For example, researchers evaluated people's behavioral data and attribute data to determine the impact of governmental policies concerning COVID-19. Most studies appropriately handled personal data. However, there is always the risk of cases where personal data are overprotected, and the people to whom data is provided are relatively limited, or conversely, events in which personal data are not adequately protected.
To protect individual privacy, it is required to eliminate personally identifiable data to make the data available to many researchers capable of conducting statistical analyses and machine learning. It is essential to remove explicit identifiers, sample population data, and apply differential privacy, which is the de facto standard privacy metric used by companies such as Google, Apple, and Microsoft. However, there has been no verification of the success of the data re-identification after applying the processes. This study shows that a person may be correctly re-identified with 20%–60% accuracy, depending on the dataset and experimental setup.
For instance, consider the following scenario. A city has established an IoT environment to collect personal data for COVID-19 measures. Assuming that John’s set of attribute values (the attributes include estimated age, height, weight, and COVID-19 infection status) is unique among the city citizens, the collection of his attribute value data of John uniquely identifies him. Therefore, if anonymity is required, it is desirable to discard John’s data. In this scenario, we assume the city holds a dataset of several citizens. This study builds a system that can predict the number of citizens who have specified values as their attribute value with high accuracy, given a person’s attribute value data, regardless of whether the person exists in the database. Such a system is important in terms of privacy and determining city policies. To make policy decisions, it is necessary to know the number of people with certain attributes in the entire population.
This study examines the above problem regarding differential privacy and incomplete multiple databases. The experimental results show that the possibility of being identified is high even with differential privacy in some cases. It also shows that the possibility of being identified increases when multiple databases are considered.
IEEE Open Journal of the Computer Society
Method of Research
Subject of Research
Individual Re-identification from Incomplete Datasets Protected by Differential Privacy
Article Publication Date