1. Field of the Invention
This invention relates to an array type recording system which is a computer storage system and more particularly to improvements in performance and reliability of a disk drive system in which a number of disk units are arranged in an array type configuration.
2. Description of the Related Art
Various documents and patents for disk drive systems each consisting of a number of disk units arranged like an array have been published. One of the documents is a publication on a system which dramatically improves the reliability of data stored in mass storage from Berkeley college of California University. This paper "A Case for Redundant Arrays of Inexpensive Disks (RAID)," Proc, ACM SIGMOD Conf., Chicago, Ill., June 1988 classifies data reliability improvement systems into five levels ranging from a conventional mirror disk system to a block interleave parity system. These levels are outlined below:
RAID Level 1 PA0 RAID Level 2 PA0 RAID Level 3 PA0 RAID Level 4 PA0 RAID Level 5 PA0 (a) a plurality of media for recording data; PA0 (b) a write circuit for writing data onto the recording media; and PA0 (c) a read circuit, being different from the write circuit, for reading data from the recording media. PA0 (a) a plurality of media for recording data; and PA0 (b) read/write means having first-in first-out buffers corresponding to the recording media on a one-to-one basis for reading/writing data from/onto the recording media through the first-in first-out buffers. PA0 (a) a plurality of media for recording data; PA0 (b) means for reading/writing data from/onto the recording media as requested; and PA0 (c) interface means for issuing a request to the read/write means, the interface means comprising: PA0 (a) a plurality of media for recording data; PA0 (b) write means for recording data and redundant data of the data on the recording media in a scattered manner; to write a plurality of continuous data entries, the write means writes them onto the recording media in order in predetermined direction and locates redundant data of the data entries on the recording media cyclically and starts writing data entries in the next group at the recording media where redundant data in the preceding redundant group has been written; and PA0 (c) to continuously read a plurality of continuous data entries written by the write means, when reading data from one redundant group, read means for reading redundant data in the redundant group and ignoring it, then reading data on the recording media in order in a predetermined direction starting at the recording media where the redundant data has been read. PA0 (a) a plurality of media for recording data and redundant data thereof in a scattered manner; and PA0 (b) means for reading data recorded on the recording media, the means comprising: PA0 (a) a plurality of media for recording data and redundant data thereof in a scattered manner; and PA0 (b) means for writing data onto the recording media, the means comprising: PA0 (a) a plurality of media for recording data and redundant data thereof in a scattered manner; PA0 (b) means for reading data recorded on the recording media, the means comprising: PA0 (c) means for writing data onto the recording media, the means comprising: PA0 (a) a plurality of media for recording data; PA0 (b) first-in first-out buffers located corresponding to the recording media; PA0 (c) means for reading/writing data from/onto the recording media through the first-in first-output buffers; and PA0 (d) output means for outputting data from the first-in first-out buffer by the read/write means and continuing to hold the output data to enable the same to be again output from the first-in first-out buffer. PA0 (a) a plurality of media for recording data; PA0 (b) cache memories located corresponding to the recording media for temporarily storing data read/written from/onto the recording media; and PA0 (c) means for reading/writing data from/onto the recording media through the cache memories. PA0 (a) a plurality of media for recording data; PA0 (b) a plurality of task means for issuing an access request to any desired number of recording media of the recording media; PA0 (c) means for reporting access completion to the recording media accessed by the task means; and PA0 (d) task start means being responsive to a combination of reports from the report means for selecting the corresponding task means among the task means and starting its task.
A normal mirror (shadow) system which stores the same data in two groups of disk units. The RAID level 1 system has been generally used formerly with computer systems for which high reliability is required. However, large redundancy leads to high cost per unit capacity.
The hamming code format used with DRAM is applied. Data in a redundant group is stored on disks with bits interleaved. On the other hand, to enable 1-bit error correction, an ECC code is written onto a number of check disks (four check disks are required when the number of data disks is ten) per group, one group consisting of about 10 to 25 disk units. Redundancy is slightly large.
A parity disk is fixed for use and data is recorded on data disks in the group with bytes interleaved. Since the error location is found from ECC for each drive, only one parity disk is required. The RAID level 3 system is appropriate for synchronizing spindle rotation for high speed transfer.
A parity disk is fixed for use and data is recorded on data disks in the group with blocks interleaved. RAID level 4 differs from level 3 in interleave units. That is, because of recording in block units, the RAID level 4 system is more appropriate for the case in which an access to small data is often made.
Unlike level 3 or 4, the RAID level 5 system does not have a fixed parity disk and stripes parity data on component disks. Thus, at a write operation, load concentration on parity disks does not occur and 10 PS increases (the higher the write percentage, the more advantageous the RAID level 5 system is than RAID level 4). Both operating performance and capacity efficiency are good.
A conventional example of a redundant array type disk drive system is "Array Type Disk Drive System and Method" by Array Technology Corporation in U.S.A. disclosed in Japanese Patent Laid-Open No. Hei 2-236714, wherein the redundancy level and the number of logical units of component disk units viewed from the host computer can be selected.
The method of striping parity data is shown in Japanese Patent Laid-Open No. Sho 62-293355, "Data Protection Feature" by International Business Machines Corporation in U.S.A.
FIG. 28 is a block diagram of the array type disk drive system disclosed in Japanese Patent Laid-Open No. Hei 2-236714 mentioned above, for example. In the figure, numeral 2 is a host interface (I/F) serving as a buffer between a host computer (not shown) and an array controller, numeral 3 is a microprocessor which controls an array controller, numeral 4 is a memory, numeral 5 is an EOR engine which generates redundant data and restores data, numeral 6 is a common data bus which connects the host I/F 2, the microprocessor 3, the memory 4, and the EOR engine 5, and numeral 8 is a channel controller, a plurality of channel controllers being connected to the data bus 6. Numeral 9 is a disk unit and numeral 10 is a channel; each of the disk units 9 is connected via the corresponding channel 10 to the corresponding channel controller 8. Numeral 13 is an array controller which controls the disk units 9.
FIG. 29 is a drawing for illustrating generation of redundant data on RAID. As shown in FIG. 29, stored on one of five disks is redundant data (parity) of data on the other four disks. The parity is calculated by exclusive-ORing the data on the four disks. That is, the parity data on the parity disk P results from exclusive-ORing the data on disks 0 to 3. For example, if the data on disk 0 cannot be read due to some fault, providing such parity as redundant data enables the data on disk 0 to be restored. That is, the data resulting from exclusive-ORing the data on disks 1 to 3 and parity disk can be used to restore the data on disk 0.
Although the parity can be calculated by exclusive-ORing the data on the four disks as described above, alternatively the old data on the disk onto which new data is to be written and the current parity data stored on the parity disk may be read to exclusive-OH the three types of data, the new data, old data, and parity data, thereby providing new parity data. This method is described in conjunction with FIG. 30. For example, to attempt to record new data DN (2) on disk 2, first the old data is read as DO (2) from disk 2. At the same time, the current parity data DO (P) is read from the parity disk. Next, the three types of data, DN (2), DO (2), and DO (P), are exclusive-ORed to generate new parity data DN (P). Then, the new data DN (2) is recorded on disk 2. Last, the new parity data DN (P) is recorded on the parity disk.
Next, the operation of the disk drive system shown in FIG. 28 is described. In FIG. 28, the host computer (not shown) always writes and reads data via the host I/F 2 into and from the disk system. When data is stored, instructions and data from the host computer are temporarily stored in the buffer memory 4 via the data bus 6. When data is reproduced, data provided in the buffer memory 4 is transferred via the host I/F 2 to the host computer.
FIGS. 31A and 31B are an internal block diagram and an operation flowchart of the host I/F 2. In FIG. 31A, numeral 21 is an I/F protocol chip and numeral 22 is a microcomputer. The I/F protocol chip 21 is an interface chip to handle SCSI (small computer system interface) and the microcomputer 22 analyzes the contents of data received at the I/F protocol chip 21 and outputs the result to the array controller 13 shown in FIG. 28. As shown in the flowchart of FIG. 31B, the microcomputer 22 checks a given command for validity, then analyzes the contents of the command and makes address conversion from logical address to physical address according to the analysis result. Thus, the command validity check, command decode, and address conversion are executed sequentially. Since the time required is, for example, between 300 microseconds and 1 millisecond, even if, for example, performance of other hardware devices is improved, the data transfer speed cannot be substantially improved because performance of the microcomputer 22 in the host I/F 2 does not improve.
The operation at RAID level 5 is described in conjunction with FIGS. 28 and 32. The microprocessor 3 divides data stored in the memory 4 into data blocks and determines data write disk units and a redundant data write unit. At RAID level 5, old data in the data blocks into which new data is to be written is required to update redundant data, thus a read operation is executed before a write operation. Data is transferred between the memory 4 and the channel controllers 8 via the data bus. Redundant data is generated by the EOR engine in synchronization with the data transfer.
Assuming that a data block is set to 512 bytes, for example, when 1024-byte data is written, it is recorded in two blocks 16 and 17 and parity data P is also recorded, as shown in FIG. 32. Such a recording state is called striping.
This is described in detail. First, write data disk units 9a and 9b and a redundant data disk unit 9e are determined. Next, the EOR engine 5 is started under the control of the microprocessor 3 for sending an old data read command for redundant data calculation to the channel controllers 8a, 8b, and 8e to which the data disk units 9a and 9b and the redundant data disk unit 9e are connected. After completion of reading the old data in the data disk units 9a and 9b and the redundant data disk unit 9e, new data is written into the data disk units 9a and 9b and the new redundant data generated by the EOR engine is written into the redundant data disk unit 9e as instructed by the microprocessor 3. Then, the host computer (not shown) is informed that data write is complete. As described above, when data is written, a preread of old data is required to generate redundant data, prolonging the processing time.
When one data record is divided and recorded over two or more disks as shown in FIG. 32, the two or more disks must be accessed to access the data record, degrading performance.
Next, a data read is described. When a data read is instructed by the host computer, the microprocessor 3 calculates the data block and data disk unit where the target data is stored. For example, if the data is stored in the disk unit 9c, a read command is issued to the channel controller 8c to which the disk unit 9c is connected. Upon completion of reading the data in the disk unit 9c, the data is transferred to the memory 4 and the host computer is informed that the data read is complete.
Next, data recovery and data reconstruction on a standby disk when an error occurs are described. Data recovery is executed when it is becomes impossible to read data in the disk unit 9c, for example. When it is impossible to read data in the disk unit 9c, the microprocessor 3 reads data from all disk units in the redundant group containing the read data block and the EOR engine 5 restores the data in the data block where it is impossible to read data.
For example, assuming that the redundant group consists of disk units 9a, 9b, 9c, 9d, and 9e, data blocks are read from the disk units 9a, 9b, 9d, and 9e, the EOR engine 5 restores the data in the disk unit 9c, and the data is transferred to the memory 4. Then, the host computer is informed that data read is complete.
Thus, even if an error occurring in a disk unit makes it impossible to read data, data can be recovered, improving data reliability.
Data reconstruction is executed when it is becomes impossible to use the disk unit 9c, for example. In this case, the microprocessor 3 reads data from all disk units in the redundant group containing the data stored in the data unit 9c, the EOR engine 5 restores data in the disk unit 9c, and the restored data is reconstructed on a standby disk.
For example, assuming that the redundant group consists of disk units 9a, 9b, 9c, 9d, and 9e, data is read from the disk units 9a, 9b, 9d, and 9e, the EOR engine 5 restores the data in the disk unit 9c, and the restored data is written onto a standby disk for reconstructing the data in the disk unit 9c on the standby disk. Then, the unavailable disk unit 9c is replaced by the standby disk. Since the replacement operation is performed when the system is operating, the system performance degrades during the replacement processing.
Since the conventional array type disk drive system is configured as described above, a data preread is also required to generate redundant data when data is written in a normal operation, prolonging the processing time.
When replacement occurs on an alternate (standby) disk in disk unit replacement processing, furthermore the system performance is further degraded.