Information that is used by computers is often stored in digital data files of various formats. A digital file refers to digital data that together and as a group have meaning or use to a party in possession of the file. A computer file refers to digital data that can be stored as a file on a medium such as a hard disk, a flash memory card, random access memory (RAM), read-only memory (ROM), CD-ROM, DVD-ROM, or any other medium or device designed to store data in digital form.
Digital files can be transported by wireless means, such as over a cellular phone or wireless modem, or through a wire such as over the Internet, a wired modem, a local area network (LAN), a wide area network (WAN), a virtual private network (VPN), or any other similar means.
Because digital data can be easily modified, it is often important to be able to verify the integrity of the file to confirm that the file or a subset of the file has not been altered. The important data to be verified, which may consist of the entire file or a subset of it, is referred to as “data of interest.”
One conventional way to verify the integrity of the data of interest has been to generate a data file that contains both the data of interest and a redundant copy of the data of interest encrypted in such a way that a comparison may be made between the data of interest and the redundant data. The comparison may be made by decrypting the redundant data and comparing the decrypted redundant data with the unencrypted data of interest in the file. Alternatively, the same encryption algorithm used to encrypt the redundant data may be applied to the data of interest and the newly encrypted version of the data of interest may be compared against the stored encrypted redundant data. These techniques have the drawback of requiring the storage of two copies of the digital data and are not desirable for large data files.
Another conventional technique involves applying a digest function to the data of interest to create a digest result. A digest result is a shorthand way of representing the data. Examples of a digest function include a simple checksum, a weighted checksum, bit operations, or other functions. In a simple checksum, all the bytes of the data of interest are added together and stored in an integer where the overflow bits are dropped (e.g., a CRC). In a weighted checksum, each byte of data is multiplied by a weight factor before being added to the other bytes. In bit operations technique, each byte of data of interest is subjected to various operations. For example, each byte of data may be subjected to an exclusive or (XOR) operation with a subsequent byte of data, and the result XOR'ed with the next byte until all the bytes of the data of interest are exhausted. The resulting value is the digest result.
Since all bytes of the data of interest are used to calculate a digest result, if one or more bytes are altered to another value, a user can detect that a change has occurred. If the user reapplies the digest function to the changed data, the digest result will not match the digest result calculated from the original data. The ability to compare the recalculated digest result against the original digest result and noting the difference if the data of interest has been changed allows the user to authenticate the data and detect if any alterations have occurred.
However, using a digest function has drawbacks. A digest function may map multiple sets of data into the same digest result. Therefore, a digest result calculated for an altered, forged set of data may be equal to the digest result calculated for the authentic set of data. To pass a forged set of data as authentic, it is possible to analyze the digest function to find out which sets of data that are different from the authentic data set yield the same digest result as the authentic set. The data of interest cannot be authenticated with any reliability if a digest function alone is used.
An example follows that demonstrates how using a digest function may yield the same digest result for different data sets. In this example, a weighted checksum is used as the digest function. Two different sets of data are subjected to this digest function. The authentic set contains the values of 1, 3, 5, and 2 and the forged set contains the values 3, 2, 5, and 2. The weighted sum uses weight of 2 and 4 multiplied by the data values and repeats the multiplications periodically until all data is exhausted: 2*1+4*3+2*5+4*2=32 1) 2*3+4*2+2*5+4*2=32 2) Application of this digest function to both sets of data yields the same digest result of 32.
The type of weakness, demonstrated by the foregoing example, can exist with other authentication schemes that are based on digest results of data. Because it is possible to map more than one set of data to the same digest result, it is possible for a clever programmer to break the authentication scheme and cause forged data to be mistaken for authentic data. In the above example, because the first weight factor, the number 2, was half the second weight factor, the number 4, the first data point was increased from 1 to 3 and the second was decreased from 3 to 2 thus canceling the effect of the increase in the first data point.
Therefore, a more secure system and method for verifying the integrity of a data of interest are needed that do not require storing redundant copies of data within a file.