Present invention embodiments relate to information and data management, and more specifically, to data privacy protection using a k-anonymity model with probabilistic match self-scoring.
Data privacy concerns may arise when a data holder wishes to release a version of the data for research. Tools for mining large data samples are capable of joining and searching hundreds of millions or even billions of data records. These tools may be used for applications that require trusted data about customers, clients, patients, guests, citizens, persons of interest, or other parties. When multiple sources of information are joined and intelligently processed, sensitive party information may be inferred and unintentionally disclosed.
One approach to addressing these data privacy concerns is k-anonymity. A data set released by a data holder has k-anonymity protection if the information for each person contained in the released data set cannot be distinguished from at least k−1 individuals whose information also appears in the release. Typically, a data holder anonymizes party data by a removal of explicit identifiers such as name, address and phone number while leaving other demographic information in the released data set. Such a data set can include, for example, the date of birth, gender, and zip code. In this scenario k-anonymity means that each distinct combination of date of birth, gender, and zip code will repeat at least k times in the data set. High values of the parameter k signify higher uncertainty in identification of the individual and therefore provide better privacy protection of the party data. Even if the released data is joined with the data available in the public domain (e.g., voter registration information) a potential data privacy attacker will face an uncertainty in party identification because a single record in the public domain will match at least 1 records in the data released by the data holder. However, there is a tradeoff between keeping the data complete enough to be useful and preserving data privacy.