Methods exist for processing an original data sequence in order to generate information about the data for the purposes of integrity measurement, ownership demonstration and authentication. The first category is digital watermarking, the second is data hashing and the third is error detection coding. In the present context, watermarking is often used for ownership demonstration and authentication purposes; data hashing is often used for integrity measurement purposes; and as with data hashing, error detection coding is used to measure data integrity. However, the latter is normally done as part of data transmission protocol and therefore addresses transmission errors rather than tampering.
An important step in data authentication and demonstrating data ownership is some form of registration process. This is a process that is recognized and accepted by the community of users whereby information regarding the original data is stored and later presented for comparisons. It is often important that this information is also time-stamped, properly associated with the original data and stored in a secure fashion. This identifying information regarding the original data can be referred to as registration data. In the exemplifying case of digital watermarking, the watermark, the watermark embedding method and the watermark recovery method can become part of the registration data.
Digital Watermarking
For a more in-depth review of digital watermarking, the reader is referred to [1] and [2]. Digital watermarking is the process of embedding identification data within host data typically for authentication of the host data, and for demonstrating ownership of the host data. Other applications exist and are given in the cited reference. Often, the goal is to embed the identification data in such a manner that changes are imperceptible to a human observer, when the observer only has access to the resulting watermarked data. The identification data or watermark is presumably a small amount of data relative to the host data, making the task of hiding the watermark less difficult. For example, small changes to a group of adjacent pixels in an image can be made in an imperceptible fashion when knowledge of the human ability to detect differences in color and or intensity are taken into account. Likewise, audio data can host imperceptible changes by applying an understanding of the human auditory response. Thus to a human, the watermarked audio data can sound exactly like the original, but in fact be slightly different.
When the goal is to embed watermarks in an imperceptible fashion, there is a trade-off that exists between imperceptibility and robustness to attack. Basically, as more changes are made to the host data as a result of embedding the watermark, better protection can be obtained. Consequently, it becomes less likely that the resulting watermarked data can be altered by an attacker in a manner such that the watermark is essentially removed and the post-attack watermarked data is still useful. However, as more data is embedded, a point is reached where the watermarked data is perceptibly different from the host data. A balance must be obtained such that a margin of robustness to attack and watermarked data quality is maintained.
There are two major disadvantages to digital watermarking. The first is with regard to the fact that a process is required for embedding the watermark within the host data. This process will take time and will require computational resources. The second disadvantage is with regard to the fact that embedding a digital watermark within the host data will result in an alteration of the host data. This is undesirable when the host data is, for example, a digital image or a digital audio recording. In such cases, the watermarking can adversely affect image and audio quality. Even when a human cannot perceive the changes caused by watermarking, the resulting changes may reduce the value of the data.
Data Hash
A data hash is typically used for determining if one or more bits within a data sequence have changed. More recently, hashing has been proposed as an alternative to digital watermarking, when the interest is to provide a method of ownership demonstration and authentication [3]. As the term implies, a data hash is a repeatable process by which the original data is reorganized and reduced to a short sequence for protection purposes. This short output data sequence has been referred to as a fingerprint, a message digest, a hash value or simply a hash of the input data sequence. As with human fingerprints and humans, a hash value is claimed to be with high probability unique to a given input data sequence. In contrast to digital watermarking, it is not necessary for a data hash to embed auxiliary data within the original host data sequence.
Given only the hash value and hash process, it would require a search and hash procedure over an available set of data sequences to determine if the hash value came from a particular sequence. This would make it impractical for an attacker to obtain the original sequence based only on knowledge of the hash value and hash process. Alternatively, a hash value, hash process and input data sequence can be made public. This can help assure the recipients of distributed data sequences that the data is authentic and integrity has been maintained. As an example, after downloading a file over a network, the recipient can first run the hash process on the file to generate the hash value. If this hash value matches the known hash value, then the file can be considered free of unauthorized alterations.
Referring to FIG. 1, a generic description of a prior art data hash process is shown. Pre-processor 101 and Post-processor 103 are optional steps. Processes for Pre-processor 101 can include transformation into an alternate representation domain, data padding and statistical calculations. These operate on the Input Data Sequence which is to be protected. For example, a data hash process may transform data sequences from the time domain to the frequency domain. Some hashes require the appending of known data such as a sequence of zeros in order to establish sequences of specific size. Post-processor 103 can include encoding and compression. These steps operate on the output of Hasher 102 and are often necessary in practical applications. The Output Data Sequence from Post-processor 103 is a representation of the hash of the Input Data Sequence. In most hash processes, the hash value is represented in the Output Data Sequence as a fixed length alphanumeric using hexadecimal notation.
Hasher 102 operates on the output sequence from Pre-processor 101, in such a manner as to reduce the data to an essentially unique hash value. For example, a hash process may generate a 128-bit hash value for each Input Data Sequence. This will allow for 2128 hash values and potentially as many unique Input Data Sequences.
Secure Key 104 contains the parameters necessary for the operation of Pre-processor 101, Hasher 102, and Post-processor 103. Pre-process parameters may include values for data segmentation, indexing and transformation to other representation domains. Post-processing parameters may include control information for encoding or compression processes. Subsets of the various parameters can include seed values for the generation of pseudo-random number sequences. The use of pseudo-random sequences contributes to the security of the overall hash process.
An advantage that data hashing has over digital watermarking is that no embedding process is required. This preserves the quality of the original input data sequence and is particularly important for digital images and digital audio. However, the typical data hash process has the disadvantage that even though unauthorized alterations of one or more bits will likely be detected, the resulting hash value yields no information about the nature of the alterations. Information is not obtained such as how many bit alterations have taken place, where such changes have occurred within the data sequence and how similar the altered data is to the original data. This is information that would be necessary for ownership demonstration and authentication. Newly proposed hash methods have begun to address these disadvantages for the case of image data using a wavelet decomposition [3]. Hash methods that address these disadvantages and are more generally applicable to a variety of data sequences are required.
Error Detection Coding
Error detection coding is a mature technology area within the field of digital communications and data transmission. As with data hashing, the purpose of error detection coding is to measure data integrity. Unlike data hashing, the coding can be designed to allow for the further purpose of error correction. Error detection coding is normally done as part of data transmission protocol and addresses transmission errors rather than tampering. These transmission errors occur as a result of additive noise and other distortions present in the transmission channel. Reference is made to [4] for details regarding coding for the purpose of detecting transmission errors in received data sequences. Some background is provided here for contextual purposes in relation to the subject invention.
In error detection coding, data sequences to be transmitted are mapped to new data sequences prior to transmission. The mapping results in a substantial increase in the size of the transmitted data sequence. Error detection capability is achieved at the expense of this increase in size. For example, some error coding techniques operate on blocks or segments of consecutive bits that comprise the entire data sequence. In this case, each segment of sequential bits of information is mapped to what is referred to as a code word. The mapping is done in such a way that each code word contains more bits than the corresponding information segment. Each code word is also a sequence of bits that uniquely corresponds to one of the possible information segments. However, because the code words consist of more bits, a designer can choose code words that are more easily separated at the receiver. Once received, the code words are simply mapped back to the original information segment. Without the error coding, the shorter length information segments are more difficult to distinguish between at the receiver, potentially causing transmission errors.
Error detection coding has some advantages over most data hash and watermarking processes. Although not all errors are detected, the coding often helps to identify segments of the data sequence that contain errors. Rather than re-transmitting the entire data sequence, this allows for re-transmission requests for only the affected segment. Also, with error detection coding, errors can often be corrected with no need for re-transmission. A disadvantage of error detection coding is the increase in size of the transmitted data sequence. This leads to an increase in the transmission rate, and therefore requires additional bandwidth or power resources. It is not uncommon for error detection coding schemes to result in a doubling of the data sequence size. A further disadvantage of error detection coding is that the purposes of ownership demonstration and authentication are not achieved. For example, unauthorized alterations made prior to transmission will result in a modified data sequence that is treated by the error coding process as any other input. Thus the coding process will attempt to faithfully transmit the modified data sequence.