Personal information is being continuously captured in a multitude of electronic databases. Details about health, financial status and buying habits are stored in databases managed by public and private sector organizations. These databases contain information about millions of people, which can provide valuable research, epidemiologic and business insight. For example, examining a drugstore chain's prescriptions can indicate where a flu outbreak is occurring. To extract or maximize the value contained in these databases, data custodians must often provide outside organizations access to their data. In order to protect the privacy of the people whose data is being analyzed, a data custodian will “de-identify” or “anonymize” information before releasing it to a third-party. An important type of de-identification ensures that data cannot be traced to the person about whom it pertains, this protects against ‘identity disclosure’.
When de-identifying records, many people assume that removing names and addresses (direct identifiers) is sufficient to protect the privacy of the persons whose data is being released. The problem of de-identification involves those personal details that are not obviously identifying. These personal details, known as quasi-identifiers, include the person's age, sex, postal code, profession, ethnic origin and income, financial transactions, medical procedures (to name a few). To be able to de-identify data the assessment of the risk of re-identification is required to be determined. Therefore there is a need for improved risk assessment of data sets.
It will be noted that throughout the appended drawings, like features are identified by like reference numerals.