Computing systems often include a mass storage system for storing data. One popular type of mass storage system is a xe2x80x9cRAIDxe2x80x9d (redundant arrays of inexpensive disks) storage system. A detailed discussion of RAID systems is found in a book entitled, The RAID Book: A Source Book for RAID Technology, published Jun. 9, 1993, by the RAID Advisory Board, Lino Lakes, Minn.
A typical RAID storage system includes a controller and a disk array coupled together via a communication link. The disk array includes multiple magnetic storage disks for storing data. In operation, the controller of a RAID storage system operates to receive Input/Output (I/O) commands from an external host computer. In response to these I/O commands, the controller reads and writes data to the disks in the disk array and coordinates the data transfer between the disk array and the host computer. Depending upon the RAID implementation level, the controller in a RAID system also generates and writes redundant data to the disk array. The redundant information enables regeneration of user data in the event that the data becomes corrupted.
A RAID level 1 storage system, for example, includes one or more disks (data disks) for storing data and an equal number of additional xe2x80x9cmirrorxe2x80x9d disks for storing the redundant data. The redundant data in this case is simply a copy of the data stored in the data disks. If data stored in one or more of the data disks becomes corrupted, the mirror disks can then be used to reconstruct the corrupted data. Other RAID levels store redundant data for data distributed across multiple disks. If data on one disk becomes corrupted, the data in the other disks are used to reconstruct the corrupted data.
One common function of a controller in a RAID storage system is to periodically test each disk for data corruption. This is often accomplished by the controller performing a series of read operations upon each disk. If data is not received by the controller from a particular read operation, the corresponding data is determined to be corrupted. The controller may then notify the user or may automatically attempt to recover the data or both. If, however, data is received as a result of the read operation, the data is assumed to be valid.
One problem with this type of test is that it may not detect corrupt but readable data stored in the disk array. This type of data corruption may sometimes be referred to as xe2x80x9csilent data corruptionxe2x80x9d. That is, the data can be read but does not accurately represent the same pattern of bits that were originally stored. Importantly, this type of data corruption may remain undetected for an extended period of time This is especially true if the corruption occurs to a small segment of the stored data (such as a single data block) that is rarely accessed by the user.
The occurrence of silent data corruption can significantly increase the risk of data loss in a RAID storage system. For example, consider a RAID-5 system that has five disks, with the fifth disk being a parity disk. Let us assume that a single data block stored on the second disk is readable but is corrupted and that this corruption remains undetected. Now further assume that the fourth disk fails. The failure of the fourth disk is subsequently detected by the controller and the controller begins a data recovery operation to recover the data in the fourth disk. Unfortunately, because there is a corrupt data block in the second disk, some data on the fourth disk is not recoverable. In addition, the corrupted data block on the second disk is also not recoverable.
In view of the foregoing it can be seen that what is needed is an improved way to detect silent data corruption in a RAID storage system prior to additional data being lost in the storage system.
In one embodiment of the invention a controller for a disk array is provided. The disk array is for storing at least one data block and an error code for the data block. The controller operates to periodically select the data block to be tested for corruption. Upon the data block being selected to be tested, the controller further operates to determine if the block is corrupt based upon the value of the data block and the value of the error code for the data block.
In another embodiment, a storage system is provided. The storage system includes a disk array for storing at least one data block and an error code for the data block. In addition, the storage system includes a first and a second apparatus. The first apparatus is operable to periodically select the data block to be tested for corruption. The second apparatus is operable to respond to the data block being selected by determining if the data block is corrupt by using the value of the data block and the value of the error code.
In yet another embodiment, a computer product is provided for identifying corrupt data stored in a disk array for storing a plurality of data blocks and an error code for each of the data blocks. The computer product includes a computer usable medium having a computer readable program means for causing the computer to: (a) periodically select a data block from the plurality of data blocks to be tested for corruption; (b) perform a read operation upon the disk array to read the data block and the error code for the data block; and (c) if the data block and error code is readable, then determining if the data block is corrupt based upon the value of the data block and the value of the error code. In one implementation, the computer product is a controller for the disk array. The controller being operable to receive I/O commands from an external host computer and to transfer data between the disk array and the host computer in response to the I/O commands. In one preferred implementation, the program is a background program having a pre-determined level of priority.
In addition, the program may further cause the computer to attempt to recover the data block if the data block is determined to be corrupt.
Other aspects and advantages of the present invention will become apparent from the following description, taken in conjunction with the accompanying drawings, illustrating by way of example the principles of the present invention.