The present invention relates to a disk array system which stores a plurality of data in duplex.
Presently, in an information processing apparatus such as a computer, data required by a host device such as a central processing unit (CPU) is stored in a secondary memory and the data is read/written in response to a request from the CPU. Such a secondary memory usually uses a non-volatile memory medium, as represented for example by a magnetic disk drive or an optical disk drive.
As the information integration proceeds, an improvement of a performance characteristic such as processing speed of the secondary memory or an improvement in reliability has been desired. In order to meet such a requirement, a disk array system comprising a plurality of disk drives (hereinafter referred to as drives) of a relatively small capacity has been proposed.
Disk array systems are classified into several types depending on their methods for storing data.
FIG. 16A shows a method of storing data by using mirror disks.
In the disk array system which uses the mirror disks, identical data is stored in two drives in the same manner. In FIG. 16A, drive #1 and a drive #2 are paired and a drive #3 and a drive #4 are paired, and one logical group is configured by the two pairs of mirror disks. In such an array disk system, the same data is stored in the drives of each pair. Thus, if a fault occurs in the drive #1 and the data stored therein cannot be read out, the data may be read from the drive #2 so that the access to the data stored in the drive #1 is enabled. If a fault occurs in the drive #2, the data may be read from the drive #1 so that the access to the data store in the drive #2 is enabled. In this manner, in the pair of drives #1 and #2, one of the data is backup data to enhance the resistance to the drive fault. The same is true for the pair of the drives #3 and #4.
FIG. 16B shows a disk array system of a type (RAID3) in which one data transferred from the CPU is divided into three and the divided data are parallelly stored in a plurality of drives #1 to #3. In this disk array system, when the recovered data is to be read out, the divided data stored in the respective drives are simultaneously read as opposed to the storing of the data, and the read data are assembled to reproduce the original data, which is transferred to the CPU. The parallel reading of the data from the plurality of drives or the writing thereof is hereinafter referred to as parallel processing. In the parallel processing of the plurality of drives, the rotation of the disks which are recording media is synchronized for each group of drives for the parallel processing so that the data is read and/or written for the same address for the drives in the group. Thus, the plurality of drives perform the same operation. In the disk array system in which the data is divided for the parallel processing, an error correction code (ECC) is prepared based on the divided data in order to enhance the reliability and a drive #4 for exclusively storing the ECC is provided. When the ECC is an odd parity, a parity is set such that the number of "1" bits for lateral bits is odd for the data stored in each drive. For example, as shown in FIG. 18, it is assumed that for a row #7 the data bits of the drives #1 and #3 are "1" and the data bit of the drive #2 is "0". In this case, the parity is "1". If a fault occurs in the drive #1 (same for the drive #2 or #3) and the data cannot be read therefrom, the data of the drive #1 may be recovered from the data of the drives #2 and #3 and the parity of the drive #4.
In a third type (RAID5) of the array disk system shown in FIGS. 17A and 17B, data is not divided but stored in one of the drives and the respective data are handled independently. In such a disk array system, an ECC is also prepared in order to enhance the reliability. The ECC is prepared for a group of data in a row as shown in FIGS. 17A and 17B. In this disk array system, a drive for exclusively storing the ECC is not provided but the ECC is stored together with the data in the respective drives. If a fault occurs in the drive #1 (same for the drive #2, #3 or #4), the data in the drive #1 can be recovered from the data and the parity stored in the drives #2, #3 and #4.
A representative article for such a disk array system is "A Case for Redundant Arrays of Inexpensive Disks (RAID)" by D. Patterson, G. Gibson and R. H. Kartz, ACM SIGMOD Conference, Chicago, Ill. (June 1988) pages 109-116. In this article, the discussion of the performance and reliability of the disk array systems is reported. The mirror system described first is discussed in the article as a first level RAID, the parallel processing system of the divided data described secondly is discussed as a third level RAID (hereinafter RAID3), and the data distribution and parity distribution system described thirdly is discussed as a fifth level RAID (hereinafter RAID5). It is presently considered that the disk array systems described in this article are the most common disk array systems.
In the prior art disk array system of the mirror type (FIG. 16A), when a large volume of data such as data #1 to #12 are to be transferred to a cache memory, the data is normally sequentially read from the drive #1 or #2 in the order of #1, #2, #3, #4, . . . #12 and they are transferred to the cache memory. A data processing time Tm is given by EQU Tm=D/(S.times.1000)+Toh(s)
where D (KB) is a volume of data to be transferred to the cache memory, S (MB/s) is a transfer rate and Toh is an overhead of the processing. Tm is equal to a time to normally process data in one drive. As a result, a high speed transfer is not expected. In the normal accessing to read/write a small volume of data between the CPU and the drive, the access performance (a maximum number of read/write items per unit time) allows to accept up to four requests in parallel by four drives for reading and to accept up to two requests by two pairs of two drives for writing. Thus, the performance to transfer a large volume of data at a high speed is low but the normal read/write processing performance between the CPU and the drive is high.
On the other hand, in the prior art disk array system of the type RAID3 (FIG. 16B), the data is divided and stored in the drives #1 to #3 and the read/write of the data is always simultaneously done for the four (including one parity drive) drives. As a result, the data in each drive does not make sense and one complete data is not attained unless all data of the drives #1 to #3 are acquired. In this case, the transfer rate is 3.times.S (MB/s) and a data processing time T3 is given by EQU T3=D/(S.times.1000.times.#)+Toh.apprxeq.Tm/3(s)
where D (KB) is a volume of data to be transferred to the cache memory and Toh is an overhead. In the RAID3, the parallel processing is performed and when a large volume of data is to be transferred, Toh may be ignored as shown in FIG. 19A. The data transfer time is approximately 1/3 of that in processing data by one drive in the prior art mirror system (FIG. 16A). Accordingly, it is effective when a large volume of data is to be sequentially transferred to the cache memory at a high speed.
However, when the CPU is to make normal read/write to the drive, the data storage area is random, the data is of small volume and the request is frequently issued. In the RAID3, since the drives #1 to #4 are simultaneously accessed for one data, only one read/write request may be accepted at a time although there are four drives. Although the transfer speed is improved by the parallel processing, it is not very effective because the occupation ratio of the overhead in the data processing time is large when the data volume is small. As a result, much improvement of the performance in the normal read/write operation between the CPU and the drive cannot be expected.
In the disk array system of the RAID5, as shown in FIG. 17A, where blocks of data are stored for each drive (for example, data #1, #2, #3, #4, #5, #6, #7, #8 and #9 for the drive #1) and the data are to be sequentially transferred to the cache memory starting from the data #1, a series of data are read from the drive #1 and they are transferred to the cache memory. Thus, the time required for the data processing is equal to the time required to process the data by one drive, as it is in the prior art mirror system. However, when the entirety of the data of the logical group are to be read and/or written, high speed transfer may be attained by parallelly processing the drives #1, #2, #3 and #4.
On the other hand, as shown in FIG. 17B, where the blocks of data (data #1, #2, #3, #4, #5, #6, #7, #8 and #9) are stored across the drives and the data are to be sequentially processed, the data from the drives #1, #2, #3 and #4 are parallelly processed as they are in the RAID3 and they are transferred to the cache memory. Accordingly, in this case, if the volume of data is large, the processing time is approximately one third of the prior art mirror system in which the data is processed by one drive, as it is in the RAID3. This method of storing the data in the RAID5 is effective when a large volume of data is to be sequentially transferred to the cache memory at a high speed.
In the RAID5, when a small volume of data is to be read and/or written randomly between the CPU and the drive, up to four read requests and up to two write requests may be simultaneously accepted with four drives whether the data is stored in the manner shown in FIG. 17A or FIG. 17B, as they can in the prior art mirror system. However, when the data is to be written in the RAID5, a large overhead is required for the modification of the parity. For example, in FIG. 17A, when the data #10 of the drive #2 is to be updated (or in FIG. 17B, when the data #2 is to be updated), the data #10 and the parity of the drive #4 are read. A waiting time of one half revolution, on average, is required for the reading. A new parity is prepared based on the read data #10 and parity and the data #10 to be newly written, and the data #10 to be newly written and the newly prepared parity are written into the drives #2 and #4, respectively. At this time, a waiting time of one more revolution is required. Thus, a waiting time of at least one and a half revolution is required for the writing. In the prior art mirror system and the RAID3, the revolution waiting time for the read/write processing is one half revolution on average. The rotation of the disk is a mechanical overhead which is much larger than other electrical overhead. As a result, the disk revolution waiting time in the write processing is a very large overhead which causes a significant reduction in the performance when a small volume of random data is to be processed. Accordingly, in the RAID5, where the data is stored as shown in FIG. 17B, the performance to transfer the large volume of data at a high speed is high but the process performance is lowered when the write requests increase in the normal read/write processing between the CPU and the drive.
As discussed above, in the prior art mirror system, the RAID3 and the RAID5, it is not possible to be consistent with the requirement of high speed transfer of the large volume of data between the semiconductor memory and the drive and the requirement of the improvement of the normal read/write performance between the CPU and the drive.