1. Field of the Invention
The invention relates generally to a system and method for storing data, and more particularly, to a method and system for comparing data stored on a first storage system to corresponding data stored on a second storage system.
2. Description of the Related Art
In many computing environments, large amounts of data are written to and retrieved from storage devices connected to one or more computers. As more data is stored on and accessed from storage devices, it becomes increasingly difficult to reproduce data if the storage devices fail. One way of protecting data is by backing up the data to backup media (e.g., tapes or disks). The backup media may then be stored in a safe location.
Other techniques for backing up data require comparing a block of data stored on a backup storage device to a corresponding data block on a primary storage device. If, for example, asynchronous mirroring is used to generate a backup copy of data—e.g., a cache is used to temporarily store data written to the primary device before writing to the backup, or mirroring, device—an interruption in the communication between the cache and the mirroring device can cause data to be lost and the backup copy to become corrupted. Generally, in such case, it is necessary to synchronize the mirroring device with the primary device, i.e., ensure that each sector of data on the backup device is identical to the corresponding sector on the primary device, before storing additional data.
One method for reconciling data on the backup storage device with the data stored on the primary storage device is to compare each block of data on the backup device with the corresponding block of data on the primary device. This requires either transferring each data block from the backup device to the primary device or transferring each data block from the primary device to the backup device. In some cases this may be an adequate solution. However, this approach typically requires a large bandwidth over the communications link between the two devices. This method can also be unacceptably slow. If the backup device is located at a remote location, these problems may be exacerbated. If a large amount of data is involved, it is often necessary to utilize a high-speed communication link between the primary device and the remote site where the backup device is located. Because high speed communication links are typically expensive, this solution is often undesirable.
This approach additionally poses security risks. Whenever a block of data is transmitted over the communication link, a third party may have an opportunity to intercept the data. The third party may intercept the data for espionage purposes, sabotage purposes, etc.
Techniques have been developed to reduce both the bandwidth requirements and the time needed to synchronize data between primary and backup storage devices. One approach is to identify and flag blocks of data on the backup device that are inconsistent with the corresponding data blocks on the primary device, and copy from the primary device to the backup device only the flagged data blocks. In accordance with one such technique, the backup device uses a known function to generate, for a respective data block, a first digest that represents the contents of the data block, and transmits the first digest to the primary device. The primary device retrieves a corresponding block of data and uses the same function to generate a second digest. The primary device then compares the first digest to the second digest. If the digests match, then the data blocks stored in the corresponding storage locations are assumed to be duplicates of one another. If the digests are not the same, then the data blocks stored in the corresponding storage locations are different. If the data blocks are different, the data block from the primary device is transmitted over the communication link to the backup device.
To be practical, a digest should be substantially smaller in size than the data block. Ideally, each digest is uniquely associated with the respective data block from which it is derived. Any one of a wide variety of functions can be used to generate a digest. Cryptographically strong hash functions are often used for this purpose. Another well-known function is the cyclic redundancy check (CRC). A digest-generating function is referred to herein as a D-G function.
A D-G function which generates a unique digest for each data block is said to be “collision-free.” In practice, it is sometimes acceptable to implement a D-G function that is substantially, but less than 100%, collision free.
Although this technique significantly reduces the amount of data that must be transmitted in order to synchronize two storage volumes, it does not entirely resolve the security problem. If the D-G function employed in the process is reversible, a third party may intercept the digest and derive the data block from the digest. Even if the D-G function is irreversible, a party familiar with the synchronization operation may intercept the digest, alter data in one or more of the storage systems, and in a subsequent synchronization operation retransmit the intercepted digest at the appropriate moment, thereby concealing the altered data.