A common component in most computer systems, such as a personal computer (PC), laptop computer, workstation, etc., is a disk drive, also referred to as a hard disk, a hard drive, fixed disk, or magnetic disk drive. Disk drives store data on a set of platters (the disks) that are coated with a magnetic alloy that is sensitive to electrical fields introduced by read/write heads that are scanned over the platters using a precision head actuator. As a platters spin beneath the read/write head at a high rate of speed (e.g., up to 10,000 revolutions per minute), electrical impulses are sent to the read/write head to write data in the form of binary bit streams on the magnetic surface of the platters. Reading is performed in an analogous manner, wherein magnetic field changes are detected in the magnetic platter surface as the platters spin to read back a binary bit stream.
As disk drives get progressively larger in storage capacity, the effect of a failed disk increases somewhat proportionally. For example, a modem disk drive can store 250 or more gigabytes of data—enough storage space for literally 10's of thousands of files, which is generally an order of magnitude more than the storage capacity available just a few years ago. Furthermore, it used to be fairly common to have multiple disk drives for a given PC, due in part to the desire of increasing total platform storage capacity. In most instances, the failure of one of the multiple disks was not as bad as a failure to the only disk drive for the system. However, due to the massive capacity of today's disk drives, there is rarely the need to have multiple disks for a personal workstation, such as a PC.
This leads to a return to the single disk system. Although the mean-time between failure (MTBF) advertised for modern disk drives is very impressive (e.g., 100,000 hours or more), the effective failure rate is significantly higher. This is primarily due to the way the MTBF values are determined. Obviously, the manufacturer wants to present data for its latest product, which means testing of that product can only be performed for a limited amount of time, such as 2000 hours or less (84 days). Thus, if 500 disk drives are tested for 2000 hours each, and one failure results (representing 0.2%), the MTBF is 100,000 hours. In the meantime, a significant percentage of the same drives might fail at 20,000 hours, for example. The point is that disk drives are prone to failure at much lower cumulative hours than indicated by the MTBF values, and failures are unpredictable.
Disk drive original equipment manufacturers (OEMs) have long recognized the potential for disk failures at any point in time. Disk failures, whether actual or perceived, create problems for the OEMs, as well as system integrators, such as Hewlett-Packard, Dell, IBM, etc. First, if a drive actually fails, the customer will usually be very upset (potentially losing 1000's of files and corresponding work product). Second, if the failure is for a drive under warranty, the drive may have to be replaced. Third, replacement of a drive adds costs to both the OEM and the customer.
Perceived failures are also problematic. OEM system manufacturers typically support end users via telephone centers that answer calls, diagnose problems, and instruct end users with corrective actions. One of the biggest challenges in remote support is rapid and correct diagnosis of a problem's root cause so that the right fix can be applied. An end user often describes a symptom (for example, “Windows will not boot”) that has many possible causes. Often, end users assume that the hard drive is defective (“so probably my hard drive is broken”). End users have a natural tendency to assume that there is a hard drive defect, because programs and data reside on the hard drive—and many error messages are not easily understandable, such as “data error occurred in user application at 8E000”. One OEM estimates 80% of the drives it receives as defective are found to have no defects.
In view of these diagnostic problems, OEMs have developed build-in diagnostic and testing capabilities. One such set of diagnostics, called the S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology) system, is built into most modem ATA (Advanced Technology Attachment) and SCSI (Small System Computer Interface) disk drives. S.M.A.R.T. disk drives internally monitor their own health and performance. In many cases, the disk itself provides advance warning that something is wrong, helping to avoid the scenario described above. Most implementations of S.M.A.R.T. also allow users to perform self-tests on the disk and to monitor a number of performance and reliability attributes.
One current approach for accessing built-in diagnostics is via diagnostic user applications. User applications run on operating system kernels and are selectively run by users. Accordingly, the only way diagnostics data can be obtained is if a user runs the diagnostics. Unfortunately, users are not likely to run the diagnostics unless they detect a problem—some types of which may prevent the operating system for loading, and thus preventing the diagnostic user application from being run in the first place. More recently, some operating systems have added the ability to automatically perform some disk drive diagnostic testing during boot-up. However, this does little for computer systems that are infrequently booted and run for days or weeks at a time. Furthermore, the level of detail provided by the operating system (such as “disk failure imminent”) is so general or has been shown to be so historically inaccurate that the warnings are often ignored.