Field
The present disclosure relates to risk assessment of datasets and in particular to reducing re-identification risk of a dataset.
Description of Related Art
Personal information is continuously captured in a multitude of electronic databases. Details about health, financial status and buying habits are stored in databases managed by public and private sector organizations. These databases contain information about millions of people, which can provide valuable research, epidemiologic and business insight. For example, examining a drugstore chain's prescriptions can indicate where a flu outbreak is occurring. To extract or maximize the value contained in these databases, data custodians often must provide outside organizations access to their data. In order to protect the privacy of the people whose data is being analyzed, a data custodian will “de-identify” or “anonymize” information before releasing it to a third-party. An important type of de-identification ensures that data cannot be traced to the person about whom it pertains, this protects against “identity disclosure”.
When de-identifying records, removing just direct identifiers such as names and addresses is not sufficient to protect the privacy of the persons whose data is being released. The problem of de-identification involves personal details that are not obviously identifying. These personal details, known as quasi-identifiers, include the person's age, sex, postal code, profession, ethnic origin and income, financial transactions, medical procedures, and so forth. De-identification of data requires an assessment of the risk of re-identification.
Re-identification risk is measured on the data set to ensure that, on average, each individual has a certain level of anonymity. If the risk of a data set is too great, fields will need to be generalized or suppressed according to a de-identification scheme. In order to determine if the de-identification scheme is acceptable, the de-identification steps are performed and a risk measurement is done. These can be very time consuming steps, often measured in hours or days, and the result may be that the re-identification risk after applying a de-identification scheme still may be too high. Thus, the user must iterate on a process requiring hours or days of processing per iteration, adding up to very long process.
Once the risk is determined, the risk may be reduced if necessary by use of suppression. Suppression is a risk mitigation technique that removes a field value from a dataset in order to lower risk. For example, suppose a re-identification risk of a database is measured. If the measured risk needs to be lowered, suppression may modify a field in the database by replacing actual data in the field with an analytic model of what the data in the field should be. However, if suppression is not done intelligently, the suppression may introduce problems in a returned dataset, and may take a relatively long time to produce a sufficiently anonymized dataset, i.e., a dataset that has been de-identified.
Previously, in order for a de-identification scheme to be proved appropriate, the de-identification steps would need to be performed and the re-identification risk subsequently measured. These are both time intensive procedures. Furthermore, a de-identification scheme may only minimally affect risk, despite the scheme involving major modifications to the data set. Thus, trying several de-identification schemes may be necessary before finding an adequate scheme.
Previous techniques in the background art for suppression included picking values (e.g., picking a data field for all records in a database, or picking only specific records having predetermined value(s) in the data field), nulling out the picked values, re-measuring the re-identification risk, and then reiterating in a loop if the re-identification risk is still too high. In the background art, this iterative process was found to take excessive time to converge to a an acceptable solution, e.g., hours or days. In some cases, time to converge would be unknown because the process would be aborted by users as having exceeded their user tolerance.
Accordingly, systems and methods that enable improved risk assessment remains highly desirable.