Electronic databases of patient health records are useful for both commercial and non-commercial purposes. The patient health records are typically collected from multiple sources in a variety of formats. For example, medical service providers supply individually identified patient transaction records to medical insurance industry for compensation. The patient transaction records, in addition to personal information data fields or attributes, may contain other information concerning, for example, diagnosis, prescriptions, treatment or outcome. Such information poses significant security and privacy problems. Therefore, to preserve individual privacy, it is important that the patient records integrated with a database facility are “anonymized” or “de-identified”.
Another concern with sensitive datasets is unauthorized duplication, distribution and tampering after release of the datasets to one or more intended recipients. Digital watermarking can be used to determine the source of an unauthorized or illegally disseminated copy. For example, when a document is to be secured using digital watermarking, an identifier that identifies the customer who is to receive the electronic distribution copy of the document can be imperceptibly embedded in the document, along with the copyright holder's watermark. Further, the main application of watermarking a relational database includes ownership assertion, fingerprinting, and fraud and tamper detection. For example, if a recipient of the database disseminates copies of the distribution copy contrary to the interests of the copyright holder, the recipient can be identified based on the digital watermark, which is present in all the unauthorized or illegally disseminated copies. However, when many distribution copies are disseminated legally to different recipients, individually linking each distribution copy to a specific recipient has typically proven to be difficult and time consuming.
Related art includes various schemes of fingerprinting individual records of a dataset intended to be released to multiple recipients. One such scheme includes query optimization for fingerprinting relational databases while satisfying usability constraints. However, such schemes may be susceptible to incorrect fingerprint detection following data tampering or an attack due to dependence of fingerprint decoding on usability constraints.
Related art also includes a K-anonymity process, which is a model for protecting privacy. This privacy model and process was proposed in order to prevent record linkage. A table is considered “K-anonymous” if quasi-identifier (QI) values of each record are indistinguishable from at least K−1 other records in the dataset. For example, if a record includes a QI value, there are at least K−1 other records that have the same QI value. The records that share the same QI value form an Equivalence Class (EC).
There is a requirement for watermarking and fingerprinting multiple releases of large datasets while preserving the quality of the datasets and linking each release to the corresponding recipient.