Databases or datasets containing personal information, such as databases containing healthcare records or mobile subscribers' location records, are increasingly being used for secondary purposes, such as medical research, public policy analysis, and marketing studies. Such use makes it increasingly possible for third parties to identify individuals associated with the data and to learn personal, sensitive information about those individuals.
Undesirable invasion of an individual's privacy may occur even after the data has been anonymized, for example, by removing or masking explicit sensitive fields such as those that contain an individual's name, social security number, or other such explicit information that directly identifies a person.
One way this may occur is, for example, by analyzing less explicit and so called “quasi-identifier” fields in a dataset. In this regard, a set of quasi-identifier fields may be any subset of fields of a given dataset which can either be matched with other, external datasets to infer the identities of the individuals involved, or used to determine a value of another sensitive field in the dataset based upon the values contained in such fields.
For example, quasi-identifier fields may be data containing an individual's ZIP code, gender, or date of birth, which, while not explicit, may be matched with corresponding fields in external, publicly available datasets such as census data, birth-death records, and voter registration lists to explicitly identify an individual. Similarly, it may also be possible to infer values of otherwise hidden fields containing sensitive information such as, for example, disease diagnoses, if the values in such hidden, sensitive fields are dependent upon values of other quasi-identifier fields in the dataset, such as fields containing clinical symptoms and/or medications prescribed for example, from which information in an otherwise hidden field may be independently determined.
Typical systems and methods that seek to protect information contained in a dataset include several shortcomings. For example, many conventional methods depend upon a central tenet that all fields that qualify as either explicit or quasi-identifier fields can be easily identified in a dataset, which is not always the case. In addition, typical conventional techniques primarily focus on preventing identities of individuals to be revealed and do not adequately address the situation where values in other sensitive fields, such as an HIV diagnosis, may need to be hidden. Furthermore, conventional techniques that rely upon statistical analysis or machine learning approaches to determine quasi-identifiers in a dataset, while useful, are also prone to producing many false positives (fields are falsely identified as being quasi-identifiers when they are not) as well as many false negatives (fields are falsely identified as not being quasi-identifiers when they are).
Therefore, improved methods and systems are desired for identifying and anonymizing quasi-identifiers fields in a data set whose values may be used to infer the values in other sensitive fields.