1. Technical Field
The present invention relates to rights protection and in particular to a watermarking system for a dataset containing a collection of objects, so that the relationship of the right protected (watermarked) objects does not change, hence the structure of the dataset remains the same. This means that the outcome of mining operations on the right protected data is the same as on the original data.
2. Description of the Related Art
Companies frequently outsource datasets to mining firms and academic institutions to create repositories and share datasets in the interest of promoting research collaboration. Many practitioners are reserved about sharing or outsourcing datasets, primarily because of the fear of losing the principal rights over the dataset.
Data sharing is an important aspect of scientific or business collaboration. However, data owners are also concerned with the protection of their rights on the datasets, which in many cases have been obtained after expensive and laborious procedures. The ease of data exchange through the Internet has compounded the need to assemble technological mechanisms for effectively protecting one's intellectual or pragmatic property. Two of the most prevalent techniques for safeguarding rights protection are encryption and watermarking.
Encryption obfuscates the data in such a way which renders the data unusable without a secret key, which only the legitimate owner holds and distributes. Encryption, however, is inherently a hindering factor in data dissemination. Moreover, once the encryption key is out in the open and the data unencrypted, the digital content is easily distributable. An example of the aforementioned case is the decryption of the Content Scrambling System (CSS) scheme for DVD content, which was proved to be susceptible to brute-force attacks due to its small 40-bit encryption key.
Watermarking is another technique employed in rights protection. This approach does not encrypt the data, but merely embeds a secret key into the data, slightly altering the original content, while ensuring that the important data characteristics are not distorted. Watermarking is predominantly used for image rights protection, in particular, by popular international magazines. Because such magazines have a strong Internet presence, it is very easy to fall victim to image theft. By watermarking each image and employing web crawlers, publication entities systematically check for unauthorized usage of their copyrighted images throughout Internet websites.
Other digital content that is being watermarked is audio (music) and video. For example, each of the video discs given to the Oscar jury months before the original DVD video release is individually watermarked to facilitate and conclusively indicate the source of a potential ‘leak’.
Previous work such as in commonly assigned U.S. Pat. No. 6,694,303, to Agrawal et al., entitled “Method and System for Building a Naives Bayes Classifier from Privacy-Preserving data”, filed Jan. 19, 2000 and incorporated herein by reference, built a Naïve Bayes classifier on perturbed data. This work attempts to reconstruct the original data distributions from the modified (perturbed) data. This approach does not work directly on the perturbed data and in addition does not guarantee identical outputs for mining operations. In “Watermarking Spatial Trajectory Database”, X. Jin, Z. Zhang, D. Li: Proceedings of DASFAA 2005, a way of embedding a key in sequence data is provided, but does not provide any guarantees on the outcome of mining operation and additionally is not robust on operations such as geometric transformations, since the embedding is also done in the time domain. In general, none of the previous work provides sufficient robustness to data attacks such as geometric transformations, noise addition and so on, or addresses the issue of preservation of mining results, especially when working directly on the perturbed data.