1. Field of the Invention
The present invention relates to data repository security.
2. Description of the Related Art
The piracy of digital assets such as software, images, video, audio and text has long been a concern for owners of these assets. Protection of these assets is usually based upon the insertion of digital watermarks into the data. The watermarking software introduces small errors into the object being watermarked. These intentional errors are called marks and all the marks together constitute the watermark. The marks must not have a significant impact on the usefulness of the data and they should be placed in such a way that a malicious user cannot destroy them without making the data less useful. Thus, watermarking does not prevent copying, but it deters illegal copying by providing a means for establishing the original ownership of a redistributed copy.
The increasing use of databases in applications beyond “behind-the-firewalls data processing” is creating a similar need for watermarking databases. For instance, in the semiconductor industry, parametric data on semiconductor parts is provided primarily by three companies: Aspect, IHS, and IC Master. They all employ a large number of people to manually extract part specifications from datasheets and build parametric databases. They then license these databases at high prices to design engineers. Companies like Acxiom have compiled large collections of consumer and business data. In the life sciences industry, the primary assets of companies such as Celera are the databases of biological information. The internet is exerting tremendous pressure on these data providers to create services (often referred to as e-utilities or web services) that allow users to search and access databases remotely. While this trend is a boon to end users, it is exposing the data providers to the threat of data theft. The present invention therefore recognizes a need for identifying pirated copies of data.
As understood herein, database relations which can be watermarked have attributes which are such that changes in some of their values do not affect the applications. Real world datasets exist that can tolerate a small amount of error without degrading their usability. For example, the ACARS meteorological data, which is used in building weather prediction models, has wind vector and temperature accuracies estimated to be within 1.8 m/s and 0.5 C respectively. The present invention recognizes that errors introduced by watermarking can easily be constrained to lie within the measurement tolerance in such data. As another example, consider experimentally obtained gene expression datasets that are being analyzed using various data mining techniques. Again, the present invention recognizes that the nature of the data collection and the analysis techniques is such that changes in a few data values will not affect the results. Similarly, the customer segmentation results of a consumer goods company will not be affected if the external provider of the supplementary data adds or subtracts some amount from few transactions. Finally, consider the parametric data on semiconductor parts mentioned above. For many parameters, errors introduced by watermarking can be made to be within the measurement tolerance.
The present invention further understands that in the context of databases, watermarking poses challenges that are not necessarily present in techniques for watermarking multimedia data, most of which were initially developed for still images and later extended to video and audio sources. The differences between the two applications, as understood herein, include the following.
1. A multimedia object consists of a large number of bits, with considerable redundancy. Thus, the watermark has a large cover in which to hide. A database relation, on the other hand, consists of tuples, each of which represents a separate object. The watermark must be spread over these separate objects.
2. The relative spatial/temporal positioning of various pieces of a multimedia object typically does not change. Tuples of a relation, on the other hand, constitute a set and there is no implied ordering between them.
3. Portions of a multimedia object cannot be dropped or replaced arbitrarily without causing perceptual changes in the object. However, the pirate of a relation can simply drop some tuples or substitute them with tuples from other relations.
Because of these differences, techniques developed for multimedia data cannot be directly used for watermarking relations. Likewise, watermarking techniques for text, which exploit the special properties of formatted text, cannot be easily applied to databases. Furthermore, techniques for watermarking software have had limited success, because the instructions in a computer program can often be rearranged without altering the semantics of the program. This resequencing can, however, destroy a watermark.
The present invention has recognized the above-noted problems and provides solutions to one or more of them as disclosed below.