1. Field of the Invention
The present invention relates to the field of disk drive devices in computer systems, and more particularly to fault tolerant arrays of magnetic disk drive devices.
2. Art Background
Computer systems often employ a disk drive device for the secondary storage and retrieval of large amounts of data. Disk drive devices, however, are subject to a number of possible failures which can compromise data. For example, certain tracks on a particular disk may be affected by defects in the magnetic recording media. Alternatively, data errors can be produced by the non-uniform flying height of the read/write head over the magnetic disk. Under certain circumstances, a problem referred to as "sticktion" can occur wherein the read/write head comes into contact with, and adheres to, the surface of the magnetic disk. Power outages can also cause spin-motor or servo-motor seizures. In a limited number of cases, the power supply or the controller board for a disk drive can fail completely, or a disk drive can lose functionality when the data is written onto the disk, but regain functionality when the data is read back. All of these potential failures pose a threat to the integrity of data. The extent of this threat is typically estimated by disk drive manufacturers and provided in the form of a Mean Time Between Failure (MTBF) figure, a figure which presently ranges anywhere from 20,000 and 100,000 hours.
In recent years, the failure rate for disk drives has taken on greater significance as an increasing number of systems have moved away from the use of a single, large, expensive disk toward the incorporation of an array of smaller, inexpensive disks. While an array of smaller inexpensive disks offers an improved data transfer rate and lower costs, it also poses significant reliability issues. In particular, if one assumes a constant MTBF, and that disk failures occur independently of one another, the reliability of an array of disks can be calculated according to the following equation: MTBF for disk array= MTBF of a single disk/number of disks in the array. From this equation, it will be appreciated that the MTBF for a disk array raises substantial reliability concerns.
A number of solutions have been proposed to resolve the problem of reliability in disk drive arrays. (See, for example, "A Case for Redundant Arrays of Inexpensive Disks (RAID)," David A. Patterson, Garth Gibson, and Randy H. Katz, Report No. UCB/CSD 87/391, Computer Science Division (EECS), University of California, Berkeley, December 1987.) One prior art solution utilizes a redundant disk for each data disk, and effectively mirrors all data on redundant disks. Although such a mirroring approach virtually insures data integrity, it is expensive and uses up to 50% of the system's total disk storage capacity to insure reliability. An alternative prior art solution utilizes Hamming Codes for error detection and correction. This solution, however, also utilizes a considerable number of redundant disks, and due to its complexity, cannot be done in real time in hardware.
As will be described, the present invention provides a method and apparatus for detecting and correcting disk drive failures in an array of disk drives which requires a minimal number of redundant disks. In addition, the implementation of the method and apparatus of the present invention is simple enough such that it can advantageously be accomplished in real time, in hardware.