Field
The present disclosure relates to risk assessment of datasets and in particular to reducing re-identification risk of a dataset.
Description of Related Art
Personal information is continuously captured in a multitude of electronic databases. Details about health, financial status and buying habits are stored in databases managed by public and private sector organizations. These databases contain information about millions of people, which can provide valuable research, epidemiologic and business insight. For example, examining a drugstore chain's prescriptions can indicate where a flu outbreak is occurring. To extract or maximize the value contained in these databases, data custodians often must provide outside organizations access to their data. In order to protect the privacy of the people whose data is being analyzed, a data custodian will “de-identify” or “anonymize” information before releasing it to a third-party. An important type of de-identification ensures that data cannot be traced to the person about whom it pertains, this protects against “identity disclosure”.
When de-identifying records, removing just direct identifiers such as names and addresses is not sufficient to protect the privacy of the persons whose data is being released. The problem of de-identification involves personal details that are not obviously identifying. These personal details, known as quasi-identifiers, include the person's age, sex, postal code, profession, ethnic origin and income, financial transactions, medical procedures, and so forth. De-identification of data requires an assessment of the risk of re-identification.
Once the risk is determined, the risk may be reduced if necessary by use of suppression. Suppression is a risk mitigation technique that removes a field value from a dataset in order to lower risk. For example, suppose a re-identification risk of a database is measured. If the measured risk needs to be lowered, suppression may modify a field in the database by replacing actual data in the field with an analytic model of what the data in the field should be. However, if suppression is not done intelligently, the suppression may introduce problems in a returned dataset, and may take a relatively long time to produce a sufficiently anonymized dataset, i.e., a dataset that has been de-identified.
Previous techniques in the background art for suppression included picking values (e.g., picking a data field for all records in a database, or picking only specific records having predetermined value(s) in the data field), nulling out the picked values, re-measuring the re-identification risk, and then reiterating in a loop if the re-identification risk is still too high. In the background art, this iterative process was found to take excessive time to converge to a an acceptable solution, e.g., hours or days. In some cases, time to converge would be unknown because the process would be aborted by users as having exceeded their user tolerance.
Accordingly, systems and methods that enable improved risk assessment remains highly desirable.