Personal information is being continuously captured in a multitude of electronic databases. Details about health, financial status and buying habits are stored in databases managed by public and private sector organizations. These databases contain information about millions of people, which can provide valuable research, epidemiologic and business insight. For example, examining a drugstore chain's prescriptions or over the counter drug sales can indicate where a flu outbreak is occurring. To extract or maximize the value contained in these databases, data custodians must often provide outside organizations access to their data. In order to protect the privacy of the people whose data is being analyzed, a data custodian will “de-identify” information before releasing it to a third-party. De-identification ensures that data cannot be traced to the person about whom it pertains.
In addition, there have been strong concerns about the negative impact of explicit consent requirements in privacy legislation on the ability to conduct health research. Such concerns are re-enforced by the compelling evidence that requiring opt-in for participation in different forms of health research can negatively impact the process and outcomes of the research itself: a) recruitment rates decline significantly when individuals are asked to consent (opt-in vs. opt-out consent, or opt-in vs. waiver of consent or no consent), (b) those who consent tend to be different from those who decline consent on a plethora of variables (age, sex, race/ethnicity, marital status, rural versus urban locations, education level, socio-economic status and employment, physical and mental functioning, language, religiosity, lifestyle factors, level of social support, and health/disease factors such as diagnosis, disease stage/severity, and mortality) hence potentially introducing bias in the results, (c) consent requirements increase the cost of conducting the research and often these additional costs are not covered, and (d) the research projects take a longer time to complete (because of the additional time and effort needed to obtain consent, as well as taking longer to reach recruitment targets due to the impact on recruitment rates).
When de-identifying records, many people assume that removing names and addresses (direct identifiers) is sufficient to protect the privacy of the persons whose data is being released. The problem of de-identification involves those personal details that are not obviously identifying. These personal details, known as quasi-identifiers, include the person's age, sex, postal code, profession, ethnic origin and income (to name a few).
Data de-identification is currently a manual process. Heuristics are used to make a best guess how to remove identifying information prior to releasing data. Manual data de-identification has resulted in several cases where individuals have been re-identified in supposedly anonymous datasets. One popular anonymization criterion is k-anonymity. There have been no evaluations of the actual re-identification probability of k-anonymized data sets and datasets are being released to the public without a full understanding the vulnerability of the dataset.
Accordingly, systems and methods that enable improved database de-identification are required.