1. Field of the Invention
The present invention is directed to data corruption detection and fault isolation.
2. Description of the Related Art
Disaster recovery systems typically address two types of failures, a sudden catastrophic failure at a single point in time or data loss over a period of time. In the second type of gradual disaster, updates to volumes on data storage may be lost. To assist in recovery of data updates, a copy of data may be provided at a remote location. Such dual or shadow copies are typically made as an application system is writing new data to a primary storage device at a primary storage subsystem. The copies are stored in a secondary storage device at a secondary storage subsystem.
During the transfer of data from the primary storage subsystem to the secondary storage subsystem, it is possible for the data being transferred to become corrupted by errors in hardware, in microcode, or in interconnection links between the primary and secondary subsystems.
It is important to detect data corruption as early as possible and to determine where the data corruption took place. For example, in some systems, detecting an error while removing data from cache (e.g., during a destage) at either the primary or secondary subsystem will suspend the primary and secondary storage subsystems, and the data will no longer be in memory in a channel adapter at the primary or secondary subsystem to aid in detecting where the error was introduced.
Some systems solve this problem by calculating and checking a longitudinal redundancy check (LRC) value over data on both the primary and secondary storage subsystems. LRC may be described as an error checking technique that generates a longitudinal parity byte from a specified string or block of bytes (e.g., 512 bytes) on a longitudinal track. At the primary storage subsystem, the generated parity byte is sent with the string or block of bytes to the secondary storage subsystem. When the string or block of bytes are received, the receiving computer regenerates the parity byte and compares the regenerated parity byte to the transmitted parity byte. If the parity bytes do not match, an error is detected. The secondary storage subsystem notifies the primary storage subsystem that an error was detected, and the primary storage subsystem resends the data. Unfortunately, an LRC may be defeated by multiple bit errors and may not detect improperly aligned and/or truncated data transfers.
Also, when conventional systems use LRC to detect data corruption on the secondary storage subsystem, the conventional systems do not isolate where the data corruption originated.
Thus, there is a need in the art for improved data corruption detection and fault isolation.