Computing systems often include a mass storage system for storing data. One popular type of mass storage system is a xe2x80x9cRAIDxe2x80x9d (redundant arrays of inexpensive disks) storage system. A detailed discussion of RAID systems is found in a book entitled, The RAID Book: A Source Book for RAID Technology, published Jun. 9, 1993, by the RAID Advisory Board, Lino Lakes, Minn.
A typical RAID storage system includes a controller and a disk array coupled together via a communication link. The disk array includes multiple magnetic storage disks for storing data. In operation, the controller of a RAID storage system operates to receive Input/Output (I/O) commands from an external host computer. In response to these I/O commands, the controller reads and writes data to the disks in the disk array and coordinates the data transfer between the disk array and the host computer. Depending upon the RAID implementation level, the controller in a RAID system also generates and writes redundant data to the disk array according to a particular data redundancy scheme. The redundant data enables recovery of user data in the event that the data becomes corrupted.
A RAID level one (RAID 1) storage system includes one or more data disks for storing data and an equal number of additional xe2x80x9cmirrorxe2x80x9d disks for storing the redundant data. The redundant data in this case is simply a copy of the data stored in the mirror disks. If data stored in one or more of the data disks becomes corrupted, the mirror disks can then be used to reconstruct the corrupted data. Other RAID levels store redundant data for user data distributed across multiple disks. If data on one disk becomes corrupted, the data in the other disks are used to reconstruct the corrupted data.
Each of the RAID levels is associated with a particular mix of design tradeoffs. For example, a RAID 1 storage system will typically have a higher xe2x80x9cmean time to failurexe2x80x9d (MTTF) and a higher I/O rate than a RAID 5 storage system. For purposes of this document, the term xe2x80x9cfailurexe2x80x9d refers to the actual loss of data. For example, if a single byte of user data becomes corrupted in a RAID 1 storage system a failure has not occurred as long as the corresponding mirror data can still be used to recover the corrupt data. If, however, the corresponding mirror data also becomes corrupted, a failure has occurred as the data is now unrecoverable. Thus, MTTF can be considered a measure of the risk of data loss.
In order to combine the advantages of more than one data redundancy scheme, hierarchical data storage systems have been developed. Such systems typically include more than one storage area each for storing data according to a particular data redundancy scheme. For example, in a typical hierarchical RAID storage system, data can be stored according to multiple RAID architectures.
One common type of hierarchical RAID storage system includes a RAID 1 storage area and a RAID 5 storage area. Critical data is stored in the RAID 1 storage area to take advantage of the relatively higher MTTF and higher I/O rate. Less critical data is stored in the RAID 5 storage area to take advantage of the lower cost per megabyte characteristics of a RAID 5 data redundancy scheme.
One common function of the controller in a hierarchical RAID storage system is to intermittently test each disk in the system for the presence of data corruption. This serves to increase the MTTF of the storage system as corrupt data that is detected can be recovered before a failure occurs. That is, before data that would be used to recover the corrupt data also becomes corrupted. Historically, these tests were accomplished by testing the data uniformly across the storage system.
In one embodiment, the invention is implemented as a method of testing a data storage system for corrupt data. The storage system including a first data storage area associated with a first mean time to failure (MTTF) and a second data storage area associated with a second MTTF. The method preferably includes testing the first storage area for a first amount of time and testing the second storage area for a second amount of time. The first amount of time is based at least upon the first MTTF and the second MTTF. Also, the second amount of time is based upon the first MTTF and the second MTTF.
It is noted that the method may be performed within a hierarchical RAID storage system. The first storage area may be for storing data according to a first RAID redundancy scheme and the second storage area may be for storing data according to a second RAID redundancy scheme. The first RAID redundancy scheme may be, for example, a RAID level one redundancy scheme. The second RAID redundancy scheme may be a RAID level five or a RAID level six data redundancy scheme.
In another embodiment, the invention is implemented as a data storage system. The data storage system includes a first data storage area associated with a first MTTF, a second data storage area associated with a second MTTF, and means for testing the first data storage area for data corruption at a first frequency based at least upon the first and second MTTF. In addition, the testing means may further be for testing the second data storage area for data corruption at a second frequency based at least upon the first MTTF and the second MTTF. The first data storage area may be for storing data according to a first data redundancy scheme and the second data storage area may be for storing data according to a second data redundancy scheme. The first data redundancy scheme may be, for example, a RAID level one data redundancy scheme. In addition, the second data redundancy scheme may be, for example, a RAID level five data redundancy scheme or a RAID level six data redundancy scheme.
In yet another embodiment, the invention is implemented as another data storage system. In this case, the data storage system includes a first data storage area associated with a first MTTF, a second data storage area associated with a second MTTF, and a controller operative to receive I/O commands from an external host and to coordinate data transfers between the external host and the first and the second data storage areas. The controller is further operative to test the first storage area for corruption at a first frequency based at least upon the first MTTF and the second MTTF. In addition, the controller may be further operative to test the second storage area at a second frequency based at least upon the first MTTF and the second MTTF. If the first MTTF is less than the second MTTF, the first frequency is higher than the second frequency. The first data storage area may be for storing data according to a RAID level one data redundancy scheme and the second data storage area may be for storing data according to a RAID level five data redundancy scheme or a RAID level six data redundancy scheme.