Although hard disk drives are usually reliable for use in mass storage computer systems such as in networked file storage computer systems, disk drives are mechanical devices that are susceptible to degradation and failure. When a disk drive fails, the consequences are usually severe and devastating to the data stored on the disk drive. The failure of a disk drive can result in lost data or files that were produced through a significant investment of time and effort. For commercial or enterprise operations which rely upon disk drives to retain irreplaceable customer records, the failure of a disk drive can be catastrophic.
To guard against the failure of individual disk drives, certain mass storage techniques have been developed which make, or make it possible to create, a redundant copy of the original data. In the event the original data is no longer accessible due to an disk drive failure, the data is restored by accessing the redundant copy of data or by creating the redundant copy. Within the realm of the commercial or enterprise operations in which an enormous amount of information is stored on disk, the preferred approach for storing the data on hard disk drives is one of the many different configurations of a redundant array of independent or inexpensive disks (RAID). The redundancy provided by a RAID grouping of and control over the disk drives beneficially allows a system to maintain continuous operation in the event of a disk drive failure.
During the normal operation of the mass storage computer system, data is transferred to and from the hard disk drives over a communication link. A mass storage adapter manages the transfer of data between the hard disk drives and a server computer. Various interface protocols exist to manage the reading and writing of data between the mass storage adapter and the server computer, and such interface protocols include advanced technology attachment (ATA), serial advanced technology attachment (SATA), small computer systems interface (SCSI), fibre channel (FC), and a serial access storage (SAS). Although these various interface protocols are effective, their responses to the server are limited to the detection and response of a failure of the disk drive, and not to the real-time or advance recognition of deteriorating performance in an impaired disk drive which remains partially functional although at a diminished capacity. The interface protocols provide inadequate, if any, real-time warning about a possibility of impending failure of the disk drives.
Although the ability to recover data after a disk drive failure is of tremendous benefit, waiting until the disk drive or some portion of the disk disk drive to fail has disadvantages. Impairment of a disk drive can occur because of the sudden introduction of a particulate contaminant which destroys or impairs the magnetic recording media upon which the data is written, thereby destroying or damaging the data at the location of the contamination. The magnetic recording media of the disk drive is also subject to the gradual magnetic degradation over time, in which case the data written to the magnetic media becomes more difficult to read and write due to the diminished magnetic strength. Flawed mechanical or electrical operation of the disk drive can also cause the data to be written on the magnetic recording media without adequate strength for future I/O use.
In the case of an instantaneous disk drive failure, the failure is recognized quickly by the interface protocol and the server is notified so that no further I/O commands are addressed to that disk drive and so that remedial action can be taken to attempt to recover the data contained on the failed disk drive. In contrast, when the disk drive becomes impaired through the gradual degradation of its components, the efficiency of executing I/O commands decreases slowly and the computer continues to address I/O commands to the inefficiently operating disk. Until the adversely affected disk or disk portion degrades to the point of failure, the overall performance of the mass storage computer system continues to degrade with the disk drive. The decreasing efficiency resulting from the continued degradation of the disk drive remains undetected because the interface protocol usually recognizes only disk drive failures.
The inefficient execution of I/O commands can adversely affect the performance and operation of the computer system in a number of ways. Inefficient I/O command execution slows the data throughput and overall efficiency of the mass storage computer system. If the inefficient execution of the I/O commands becomes substantial, an application timeout called by, for example a program or an operating system, may occur. An application timeout is the maximum amount of time established for execution of the particular application. The application timeout is set to indirectly recognize hardware or equipment failures or problems with software execution such as program hangs. Upon an application timeout, it is necessary to restart or reboot the entire computer system, which can become a very time consumptive and intricate task during which no mass storage data operations occur.