This invention relates to database systems, methods and computer program products and, more particularly, to systems, methods and computer program products for anonymizing data.
Large scale databases are widely used to store and manipulate data. For example, a database may include financial, demographic and/or medical records about large numbers of individuals. Data mining tools are widely used to query databases to identify relationships among the stored data.
As databases are widely distributed and queried, privacy preservation has become an increasingly sensitive problem. In releasing personal data for ad hoc analysis, one level of privacy may be obtained by removal of unique (personal) identifiers. However, even with the removal of personal identifiers, inferences can be made about individuals using some database elements that are referred to as “quasi-identifiers”. By mining the quasi-identifiers that place individuals in a predefined category, inferences may be made about individuals. In fact, in the worst case, a personal identity can be reconstructed from the existing data taken alone or in combination with other databases.
In order to preserve privacy while allowing aggregate querying, anonymization techniques have been developed. These anonymization techniques can provide that, even if publicly available information is linked with a given database, a sensitive attribute value can, at most, be related to a group of a certain size, instead of to a specific individual. At the same time, the data anonymization should be able to preserve sufficient information to support ad hoc aggregate queries over the data.