1. Field of the Invention
This invention is in the field of mass data storage systems for digital computers. Specifically, it is a control system for use with a plurality of disk drive memories, the memories being controlled in parallel and the control system being capable of detecting and correcting data transmission errors and disk drive failures without interrupting the operation of the system.
2. Description of the Relevant Art
Magnetic disk drive memories for use with digital computer systems are known. Although many types of disk drives are known, the present invention will be described as using hard disk drives. Nothing herein should be taken to limit the invention to that particular embodiment.
Many computer systems use a plurality of disk drive memories to store data. A common known architecture for such systems is shown in FIG. 1. Therein, computer 10 is coupled by means of bus 15 to disk array 20. Disk array 20 is comprised of large buffer 22, bus 24, and a plurality of disk drives 30, each disk drive having an associated disk controller 35. Bus 24 interconnects buffer 22 and the disk controllers. Each disk drive 30 is accessed and the data thereon retrieved individually. The disk controller 35 associated with each disk drive controls the input/output operations for the particular disk drive to which it is coupled. Data placed in buffer 22 is available for transmission to computer 10 over bus 15. When the computer transmits data to be written on the disks, controllers 35 receive the data for the individual disk drives 30 from bus 24. In this type of system, disk operations are asynchronous in relationship to each other.
All disk operations, in particular writing and reading, have an associated probability of error. Procedures and apparatus have been developed which can detect and, in some cases, correct the errors which occur during the reading and writing of the disks. With relation to a generic disk drive, the disk is divided into a plurality of sectors, each sector having the same, predetermined size. Each sector has a particular header field, which gives the sector a unique address, a header field code, the header field code allowing for the detection of errors in the header field, a data field of variable length, with each sector's data field being equal to the data field of every other sector, and ECC (637 Error Correction Code") codes, which allow for the detection and correction of errors in the data.
When a disk is written to, the disk controller reads the header field and the header field code. If the sector is the desired sector and no header field error is detected, the new data is written into the data field and the new data ECC is written into the ECC field.
Reading operations are similar in that initially both the header field and header field error code are read. If no header field errors exist, the data and the data correction codes are read. If no error is detected the data is transmitted to the computer. If errors are detected, the error correction circuitry located within the disk controller tries to correct the error. If this is possible, the corrected data is transmitted. Otherwise, the disk drive's controller signals to the computer or master disk controller that an uncorrectable error has been detected.
In FIG. 2 a known disk drive system which has an associated error correction circuit, external to the individual disk controllers, is shown. This system uses a Reed-Solomon error detection code both to detect and correct errors. Reed-Solomon codes are known and the information required to generate them is described in many references. One such reference is Practical Error Correction Design for Engineers, published by Data Systems Technology Corp., Broomfield, Colo. For purposes of this application, it is necessary to know that the Reed-Solomon code generates redundancy terms, herein called P and Q redundancy terms, which terms are used to detect and correct data transmission errors. In the system shown in FIG. 2, ECC 42 unit is coupled to bus 45. The bus is individually coupled to a plurality of data disk drives, numbered here 47, 48, and 49, as well as to the P and Q term disk drives, numbered 51 and 53 through Small Computer Standard Interfaces ("SCSIs") 54 through 58. The American National Standard for Information Processing ("ANSI") has promulgated a standard for SCSI which is described in ANSI document number X3.130-1986. Bus 45 is additionally coupled to large output buffer 55. Buffer 55 is in turn coupled to computer 60. In this system, as blocks of data are read from the individual data disk drives, they are individually and serially placed on the bus and simultaneously transmitted both to the large buffer and the ECC unit. The P and Q terms from disk drives 51 and 53 are transmitted to ECC 42 only. The transmission of data and the P and Q terms over bus 45 occurs serially. The exact bus width can be any arbitrary size with 8- and 16-bit wide buses being common. After a large block of data is assembled in the buffer, the calculations necessary to detect and correct data errors, which use the terms received from the P and Q disk drives, are performed within the ECC unit 42. If errors are detected, the transfer of data to the computer is interrupted and the incorrect data is corrected, if possible.
During write operations, after a block of data is assembled in buffer 55, new P and Q terms are generated within ECC unit 42 and written to the P and Q disk drives at the same time that the data in buffer 55 is written to the data disk drives.
Those disk drive systems which utilize known error correction techniques have several shortcomings. In the systems illustrated in FIGS. 1 and 2, data transmission is serial over a single bus with a relatively slow rate of data transfer. Additionally, as the error correction circuitry must wait until a block of data of predefined size is assembled in the buffer before it can detect and correct errors therein, there is an unavoidable delay while such detection and correction takes place. As stated, the most common form of data transmission in these systems is serial data transmission. Given that the bus has a fixed width, it takes a fixed and relatively large amount of time to build up data in the buffer for transmission either to the disks or computer. Finally, if the large, single buffer fails, all the disk drives coupled thereto become unusable. Therefore, a system which has a plurality of disk drives which can increase the rate of data transfer between the computer and the disk drives and more effectively match the data transfer rate to the computer's maximum efficient operating speed is desirable. The system should also be able to conduct this high rate of data transfer while still performing all necessary error detection and correction functions. Finally, the system should provide an acceptable level of performance even when individual disk drives fail.