A common requirement in many computerized systems is the need to validate or verify that the contents of a body of data have not been modified in the course of storing, retrieving, transmitting, receiving, or otherwise manipulating the data. Modifications in such circumstances might result from errors during the process of converting the contents of a body of data from one physical form (say, magnetization domains on a disk platter) to another physical form (say, electromagnetic waves), or may occur as a result of deliberate tampering with the contents of the body of data (say, through the deliberate and malicious introduction of a worm into an email message).
A common technique for meeting the data validation requirement is to process the contents of a body of data with an algorithm that generates a secondary datum, smaller in size than the original body of data. This secondary datum is then associated with the original body of data. Subsequent to some manipulation of the original body of data, the manipulated contents of the data are processed with the same algorithm to generate a new version of the secondary datum. The two versions of the secondary datum are compared, and a mismatch is taken to signal a modification of the contents of the body of data.
Many techniques are known in the prior art for computing the secondary datum, which is often called a check code or checksum. A simple mechanism is to start with a zero byte, then perform an exclusive-OR on the result with each successive byte of body of data. The one's complement of the final result is used as the check code. If the one's complement byte is appended to the original body of data, an exclusive-OR of all of the bytes in the augmented data will yield a zero result. As a further example, International Standard ISO/IEC 8473-1, “Information technology—Protocol for providing the connectionless-mode network service: Protocol specification”, defines an arithmetic checksum computed for this purpose. Similarly, U.S. Pat. No. 5,247,524 describes an exemplary method of computing a checksum for transmitted data.
A tradeoff arises between the complexity of the algorithm used to compute the secondary datum and the reliability of the algorithm in detecting modifications between a first and second version of a data module. For example, the exclusive-OR algorithm described above is insensitive to byte order rearrangement of the contents of the data module. A more complex algorithm, the cyclic redundancy check (CRC) algorithm, uses a division/remainder procedure that is sensitive to byte order, but has a higher computational cost. Like the exclusive-OR algorithm, the CRC algorithm can easily be fooled by a deliberate modification of the contents of the data module that yields the same CRC check code as the original contents.
Still more complex algorithms, known as cryptographic hash functions, have been developed that are straightforward to compute but produce check codes with the characteristic that it is infeasible to modify a data module without changing its check code.
Several systems have been described that compute check codes based on a subset of the content of a data module. These systems seek to reduce the cost of computing the check code, or to overcome weaknesses in the check code algorithm to enhance the resulting security of the check code system. U.S. Pat. No. 5,450,000 describes a method of selecting a randomly or pseudo-randomly chosen subset of the contents of a block of data when computing a check code. The method of selecting the subset is independent of the contents of the block of data. U.S. Pat. No. 7,386,627 describes a method for computing a checksum wherein two checksums are computed from two portions of data payload, then combined to yield a final checksum. In '627, the two portions of the data packet are mutually exclusive parts of the payload, but need not together constitute the entire payload. U.S. Pat. No. 7,484,096 describes a method for comparing a first body of data and a second body of data by computing check codes for each body of data and comparing the check codes, and by sampling the content of each body of data with a common sampling algorithm, and comparing the sampled content. U.S. Pat. No. 7,500,170 describes a system in which a first portion of the content of a data block is modified based on a second portion of the content of the data block, and a check code is computed based only on the first portion of the data block. The effect of the system of '170 is that the check code depends upon the entire content of the data block, even though the check code computation does not directly utilize the entire content of the data block. In each of these examples from the prior art, the selection of a subset of the data is performed without reference to the content or meaning of the data.
Integrity of data is of particular significance in the area of interactive television (iTV) application broadcast and execution. An iTV application comprises one or more binary data blocks that are broadcast with conventional video and audio content for reception and execution on a set-top box (STB). A malformed or errant iTV application may cause disruption to the normal functionality of an STB, the resolution of which may be beyond the capability of the home viewer and may necessitate an expensive service call to remedy. For this reason, each application intended for broadcast undergoes an extensive certification process, in which the application is broadcast through a delimited broadcast infrastructure to a representative sample of STB models. The execution of the application is monitored by trained technicians and engineers, and a series of tests is performed to ensure that the application meets a set of certification criteria. Once an application meets the certification criteria, an encrypted certification code is affixed to the broadcast content; when a broadcast application is downloaded by an STB, the certification code is decrypted to ensure that the application can be safely executed. The certification process introduces significant cost and delay, raising the investment for and decreasing the speed of developing and deploying new iTV applications.
These factors are at odds with the potentially-lucrative emerging market for interactive television advertisements. Advertisers and broadcast intend to provide interactive experiences tied to short advertisement segments, in some cases targeting specific areas or individuals with appropriate content. Such targeted advertising will involve the creation of multiple versions of iTV applications, and the certification of large numbers of applications will introduce unacceptable costs into the advertising campaign budgets. In some cases the differences between the multiple versions might be as simple as substituting different textual or image content into a basic iTV application framework, an example of which is shown schematically in FIG. 1. The graphical display of an exemplary iTV application 100 comprises a text banner 110 and an image 120. In this simple application the text banner and image are displayed over the underlying video content of the advertisement, and automatically disappear after a brief period of visibility. Two versions 130, 140 of the application are shown with individual text content and graphic images. While a mere change of text and image content may not violate any of the certification criteria, such modifications will alter the broadcast content, which may trigger a requirement for a new certification process.
U.S. Pat. No. 6,851,052 describes a probabilistic method for computing a validation code that is insensitive to small numbers of bit errors in transformed data. However, the method of '052 does not distinguish the location of bit errors and thus cannot discriminate between errors in significant regions and in non-significant regions of the data.
What is required is a method of validating the content of a block of data that is capable of ignoring non-significant modifications to the data while ensuring the integrity of the remainder of the content.