Disk drives are well known components of computer systems. Advances in disk drive technology have led to substantial increases in storage capacity, increased disk rotation speeds, and lower head flying heights. With these advances, there has been an increased need to detect conditions that may indicate that a head crash is imminent.
Detection of such so-called “pre-crash” conditions is referred to as “predictive failure analysis” or PFA. Conventional predictive failure analysis involves measuring a number of operating parameters of the disk drive, including head flying height, hard error rates, soft error rates, vibration, and disk run out checks, and comparing such parameters with predetermined thresholds. When a failure is indicated or confirmed by PFA, a warning may be issued to a host computer so that suitable preventative measures may be taken, such as transferring data from the disk drive and/or replacing the disk drive.
Disk drives are increasingly used cooperatively in groups, clusters or arrangements of multiple drives. In one such arrangement, RAID (Redundant Array of Independent Disks) is a set of methods and algorithms for combining multiple disk drives (i.e., a storage array) as a group in which attributes of the multiple drives are better than the individual disk drives. RAID can be used to improve data integrity (i.e., reduce the risk of losing data due to a defective or failing disk drive), cost, and/or performance.
RAID was initially developed to improve I/O performance at a time when computer CPU speed and memory size was growing exponentially. The basic idea was to combine several small inexpensive disks (with many spindles) and stripe the data (i.e., split the data across multiple drives), such that reads or writes could be done in parallel. To simplify the I/O management, a dedicated controller would be used to facilitate the striping and present these multiple drives to the host computer (e.g., server) as one logical drive.
The problem with this approach was that the small, inexpensive PC disk drives of the time were far less reliable than the larger, more expensive drives they replaced. An artifact of striping data over multiple drives is that if one drive fails, all data on the other drives is rendered unusable. To compound this problem, by combining several drives together, the probability of at least one drive out of the group failing increased dramatically.
In order to overcome this pitfall, extra drives were added to the RAID group to store redundant information. In this way, if one drive failed, another drive within the group would contain the missing information, which could then be used to regenerate the lost information. Since all of the information was still available, the end user would never be impacted with down time and the rebuild could be done in the background. If users requested information that had not already been rebuilt, the data could be reconstructed on the fly and the end user would not know about it.
Today there are six base architectures (levels) of RAID, ranging from “Level 0 RAID” to “Level 5 RAID”. These levels provide alternative ways of achieving storage fault tolerance, increased I/O performance and true scalability. Three main building blocks are used in all RAID architectures: 1) Data Striping—Data from the host computer is broken up into smaller chunks and distributed to multiple drives within a RAID array. Each drive's storage space is partitioned into stripes. The stripes are interleaved such that the logical storage unit is made up of alternating stripes from each drive. Major benefits are improved I/O performance and the ability to create large logical volumes. Data striping is used in Level 0 RAID. 2) Mirroring—Data from the host computer is duplicated on a block-to-block basis across two disks. If one disk drive fails, the data remains available on the other disk. Mirroring is used in RAID levels 1 and 1+0. 3) Parity—Data from the host computer is written to multiple drives. One or more drives are assigned to store parity information. In the event of a disk failure, parity information is combined with the remaining data to regenerate the missing information. Parity is used in RAID levels 3, 4 and 5.
If a drive fails in a RAID array that includes redundancy—meaning all RAID architectures with the exception of RAID 0—it is desirable to get the drive replaced immediately so the array can be returned to normal operation. There are two reasons for this: fault tolerance and performance. If the drive is running in a degraded mode due to a drive failure, until the drive is replaced, most RAID levels will be running with no fault protection at all: a RAID 1 array is reduced to a single drive, and a RAID 3 or RAID 5 array becomes equivalent to a RAID 0 array in terms of fault tolerance. At the same time, the performance of the array will be reduced, sometimes substantially.
Typically, PFA is performed on drives within a RAID/Server system at regular time intervals, such as every four hours. Typically, each drive performs PFA measurements at this interval, but the phase of the intervals is different for each drive. This may be done deliberately in order to avoid all drives performing PFA measurements and calculations at the same time which might reduce system performance.
While this technique, also known as PFA interval phase skew, is good for RAID/Server system performance, there are situations where it works against reliability. For example, it would be desirable to perform PFA for all drives in a RAID at the same time just prior to a rebuild operation. A rebuild operation is performed whenever a drive fails and the drive that replaces it needs to be written with new data. Rebuilds typically take 2-3 hours but can take longer under high usage conditions. If a second drive fails during the rebuild process, the customer loses all of the data on the RAID, which can be more than 200 gigabytes of data. Thus, performing PFA on all drives prior to a rebuild decreases the probability of a second drive failure while the rebuild process is taking place. A second kind of data loss that can occur during the rebuild operation is a “strip data loss”. A strip data loss results from an unrecovered read error during rebuild, and typically involves the loss of 64 KB or 128 KB of data.
In addition to forcing PFA prior to a RAID rebuild, it would also be desirable to force a PFA before and/or after the RAID/Server system is physically moved to a new location. A forced PFA would also be useful if the RAID/Server system is suspected to be damaged or the RAID/Server system's usage pattern has undergone a change (e.g., the unit has not been in use or has seen only light use, and is now planned for heavy use). In all of these instances, it is necessary for the PFA to be initiated at the system level, rather than at the drive itself.
Present systems all perform PFA at a drive-level rather than a system-level of operation. These PFAs are typically performed either on an automatic semi-periodic basis as described previously (e.g., every 4 hours) or are triggered by a specific event on the drive itself during normal operation. Ser. No. 10/023,262, filed Dec. 18, 2001, entitled “Adaptive Event-Based Predictive Failure Analysis Measurements in a Hard Disk Drive”, describes one drive-level initiated PFA scheme. In this instance, a trigger event within the drive is detected, and, in response to the detected trigger event, a predictive failure analysis is performed with respect to the disk drive hardware. Examples of drive-level trigger events include increases in media/servo error rates, temperature/humidity readings that are outside of a normal operating range, a load/unload event, and a start-stop event.
It would be desirable to provide RAID/System server-initiated PFA measurements driven off of server-initiated events. Such server-initiated events include RAID rebuild operations, RAID usage, addition of a new or used RAID unit to an existing server system, suspect handling damage to a RAID unit, or a change in usage pattern of a particular RAID unit. The details of the present invention, both as to its structure and operation, can best be understood in reference to the accompanying drawings, in which like reference numerals refer to like parts.