1. Field of the Invention
The present invention relates to a disk array system capable of accessing a plurality of disk units in parallel to input or output data. More particularly, this invention is concerned with a disk array system aiming to improve the processing performance of write operations and preventing a deadlock from occurring when an auxiliary disk unit is allocated in case of a fault or a dual-port configuration is adopted.
2. Description of the Related Art
Disk units, characterized by nonvolatility in retaining stored data, a large storage capacity, and a high data transfer rate; such as, a magnetic disk and an optical disk unit have been widely adopted as external storages for computer systems. Demands for the disk unit are a high data transfer rate, excellent reliability, a large storage capacity, and inexpensiveness. A disk array system is gaining popularity because of its capacity for coping with the above demands. In the disk array system, several to several ten compact disks are set in an array, data is distributed to and recorded in the multiple disk units, and then accessed in parallel.
When the disk array system transfers data to or from multiple disk units in parallel, the data transfer rate is a product of the number of disk units or much higher than that permitted by a single disk unit. When data is recorded with redundant information such as parity bits appended, a data error caused by the failure of a disk unit can be detected and corrected. This results in high reliability that is of the same level as that permitted by a duplex system, in which the contents of a disk unit are duplicated and recorded, with lower cost.
In the past, David A. Patterson et al., University of California at Berkeley, have published a thesis in which disk array systems, each of which accesses many disk units to transfer a large amount of data at a high rate and has data redundancy against a disk unit failure, are classified into levels 1 to 5 for evaluation (ACM SIGMOD Conference. Chicago, Ill. Jun. 1-3, 1988. P.109-116). That is to say, David A. Patterson et al. have classified redundant arrays of inexpensive disk units (RAID) into Levels 1 to 5. Levels 1 to 5 RAID will be described briefly below.
[Level 0 RAID]
FIG. 1 shows a disk array system that has no data redundancy. David A. Patterson has not set a classification of Level 0 RAID. The disk array system shown in FIG. 1 shall be referred to as Level 0 RAID. As apparent from the illustration of data(data block) A to I, a disk array control unit 10 merely distributes data into disk units 32-1 to 32-3 according to input and output requests sent from a host computer 18. The disk array system does not have data redundancy against a disk unit failure.
[Level 1 RAID]
A disk array system classified as Level 1 RAID includes, as shown in FIG. 2, a mirrored disk unit 32-2 that contains copies A' to D' of data A to D existent in a disk unit 32-1. The disk array system classified as Level 1 RAID has been widely adopted despite the low use efficiency of a disk unit, because of its redundancy and availability with simple control.
[Level 2 RAID]
A disk array system classified as Level 2 RAID stripes (splits) data in bits or bytes and reads or writes the data from or to disk units in parallel. The striped data is recorded in physically the same sectors of the disk units. A hamming code produced using data is employed as an error-correcting code. Aside from the data disk units, a disk unit is assigned for the purpose of recording hamming codes. The hamming codes are checked to specify a failing disk unit, and then data is restored. Thanks to the redundancy provided by the hamming code, even if a disk unit fails, correct data is preserved. Nevertheless, the poor use efficiency of a disk unit has deterred the practical application of the second level RAID system.
[Level 3 RAID]
A disk array system classified as Level 3 RAID has the configuration shown in FIG. 3. Specifically, as shown in FIG. 4, data a, b, and c are split in bits or sectors into data a1 to a3, b1 to b3, and c1 to c3 respectively. A parity data P1 is produced by calculating the data a1 to a3, a parity data P2 is produced by calculating the data b1 to b3, and a parity data P3 is produced by calculating the data c1 to c3. These data and parity data are written in disk units 32-1 to 32-4 in FIG. 3 which are accessed concurrently.
In the third level RAID system, data redundancy is ensured by parity data. Parallel processing of split data enables reduction in data write time. However, when access for writing or reading data is obtained once, all the disk units 32-1 to 32-4 must be sought in parallel. When a large amount of data is handled continuously, the third level RAID system is effective. However, when a small amount of data is accessed at random for, transaction processing, for example the advantage of a high data transfer rate is not exerted to deteriorate the efficiency.
[Level 4 RAID]
A disk array system classified as Level 4 RAID splits one data in sectors and writes the split data in the same disk unit as shown in FIG. 5. In the disk unit 32-1, for example, sector data a1 to a4 are written as data a. Parity data is stored in the disk unit 32-4 that is defined as a dedicated disk unit. The parity data P1 is produced by calculating the data a1, b1, and c1. The parity data P2 is produced by calculating the data a2, b2, and c2. The parity data P3 is produced by calculating the data a3, b3, and c3. The parity data P4 is produced by calculating the data a4, b4, and c4.
Data read can be executed in parallel for the disk units 32-1 to 32-3. Assuming that the data a is to be read sectors 0 to 3 in the disk unit 32-1 are accessed to read sector data a1 to a4 sequentially. The sector data is then synthesized. In data write operations, first, data and a parity data are read, and then a new parity data is produced. Thereafter, writing is performed. For writing data once, access must therefore be obtained four times. For example, when the sector data a1 in the disk Unit 32-1 is to be updated (rewritten), the old data (a1)old in the update area and the old parity data (P1)old in the associated area in the disk unit 32-4 must be read out. A new parity data (P1)new is then produced in conformity with new data (a1)new, and then written. This write operation is needed in addition to data write for update. Writing always involves access to the parity disk unit 32-4. Write cannot therefore be executed for multiple disk units simultaneously. For example, even when an attempt is made to write the data a1 in the disk unit 32-1 and the data b2 in the disk unit 32-2 simultaneously, since the parity data P1 and P2 must be read from the disk unit 32-4 and new parity data must be produced and written, the data cannot be written in the disk units simultaneously. Level 4 RAID is defined as described above, which offers little merit. A move has seldom been made to put the fourth level RAID system to practical use.
[Level 5 RAID]
A disk array system classified as Level 5 RAID has no disk unit dedicated to storage of parity data, whereby parallel reading or writing is enabled. A parity data is placed, as shown in FIG. 6, in a different disk unit for each sector. The parity data P1 is produced by calculating the data a1, b1, and c1. The parity data P2 is produced by calculating the data a2, b2, and d2. The parity data P3 is produced by calculating the data a3, c3, and d3. The parity data P4 is produced by calculating the data b4, c4, and d4.
As for parallel reading or writing, since the parity data P1 and P2 are placed in different disk units; the disk units 32-4 and 32-3, contention does not occur, and the data a1 in the sector 0 in the disk unit 32-1 and the data b2 in the sector 1 in the disk unit 32-2 can be read or written simultaneously. An overhead operation that requires write assess to be obtained four times is identical to that in Level 4 RAID. In Level 5 RAID, multiple disk units can be accessed asyhchronously to execute read or write operations. The fifth level RAID system is therefore desirable for transaction processing in which a small amount of data is accessed at random.
In Level 4 or 5 RAID, parity-data production done in the course of data write will be described below. In a disk array system containing redundant information (redundant data block), data blocks existent in corresponding storage locations in multiple disk units are exclusive-ORed according to the expression (1). A parity data is thus produced, and then placed in a parity-data storage disk unit. EQU Data a(+)data b(+) . . . =Parity data P (1)
where, (+) denotes exclusive OR. PA1 where, (+) denotes exclusive OR.
In Level 4 RAID, the save area of parity data is fixed to a specific disk unit or the disk unit 32-4 in FIG. 5. In Level 5 RAID, as shown in FIG. 6, parity data is distributed to the disk units 32-1 to 32-4. This eliminates the congestion of access to a specific disk unit resulting from parity-data read or write. As for the data read in Level 4 or 5 RAID, since the data in the disk units 32-1 to 32-4 are not rewritten, the consistency of a parity data is maintained. For data writing, however, a parity data must be changed according to data. For example, when the old data (a1)old in the disk unit 32-1 is rewritten into new data (a1)new, the parity data P1 must be updated according to the expression (2). Thus, the parity data can remain consistent with the whole of the data in disk units . EQU Old data(+)Old parity data(+)New data=New parity data (2)
However, in a conventional disk array system classified as Level 4 or 5 RAID, as apparent from the expression (2), when data write is executed, old data is read from a write object disk, a parity data is read from an area in a parity-data storage disk unit corresponding to the write-scheduled area, and then a new parity data is worked out. Thereafter, new-data and the new parity data are written in the respective disk units. For writing data once, reading and writing must be performed twice respectively. In other words, access must be obtained four times. This leads to the prolonged processing time. The improved performance cannot be expected from the disk array system.
In the disk array system classified as Level 3 RAID, which is shown in FIG. 3, data is split in the direction in which disk units are lined up (i.e. across the disk units) and written in the disks in parallel. It is therefore unnecessary to read old data and an old parity data from disk units. A new parity data can be produced by calculating split data. Compared with Level 4 or 5 RAID, Level 3 RAID therefore offers a shorter write time. In Level 3 RAID, however, all disk units must be accessed in parallel for writing. Level 3 RAID is therefore undesirable for transaction processing in which disk units must be read or written individually to handle a large amount of data.
In a disk array system having data redundancy, k disk units for storing data, and m disk units for storing redundant information relevant to stored data; such as, parity data is integrated into a disk array. The k+m disk units in the disk array configuration made up of data storage disk units and parity-data storage disk units is referred to as a rank genericly. To prevent a system from stopping due to a failure in a disk unit in a rank, at least one auxiliary disk unit must be included in the rank. If any of data storage and parity-bit storage disk units, which are included in a rank of a disk array, fails, an auxiliary disk unit is allocated instead of the failing disk. After the allocation, the associated data and a parity data is read from the data storage and parity-data storage disk units, and, for example, exclusive-ORed. The data stored in the failing disk unit can thus be restored and saved in the auxiliary disk unit. The failing disk unit is replaced with a new one by a maintenance engineer. After the replacement, the data saved in the auxiliary disk is restored to the original repaired disk.
However, in the foregoing conventional disk array system, an auxiliary disk unit is fixed. If a disk unit is recovered from a failure, restored data must be returned to the auxiliary disk that has been allocated to the data temporarily, and saving the restored data. It is time-consuming to restore the data in a failing disk.
In a disk array system made up of multiple ranks, if an access path to each disk unit has a dual-port structure, the present inventors have discovered that a deadlock may occur during disk writing.
Applicants have discovered why a deadlock occurs during writing in, for example, Level 5 RAID, and such discovery will be described below. In Level 5 RAID, the exclusive-OR of data and a parity data in disk units is calculated according to the expression (1) to produce a parity data which is then saved in a disk unit. EQU Data a(+)Data b(+) . . . =Parity data P1 (1)
As for the save areas of data and parity data, as shown in FIG. 6, the parity data P1 to P4 are distributed to the disk units 32-1 to 32-4 so as to prevent the congestion of access to a single disk unit resulting from parity-data read or write (parity-data update). When it comes to the data read in Level 5 RAID, since data in disks are not rewritten, the consistency of a parity data is maintained. For writing, however, a parity data must be changed in conformity with data.
During data updating during which old data in a certain disk unit is rewritten into new data, the calculation based on the expression (2) must be performed to update the parity data so that the parity data will be consistent with the new data. Thus, the parity data remains consistent with the whole of the data in disk units. EQU Old data(+)Old parity data(+)New data=New parity data (2)
As apparent from the expression (2), data write requires reading old data and an old parity data from disk units. Since data is written in an area from which old data is read, before write is executed for a disk, a disk unit must be rotated by one turn. This is time-consuming. For writing a parity data, a new parity data must be produced according to the expression (2). Parity-data write must therefore be place in the wait state until old data is read from a disk unit in which data is to be written.
The flowchart of FIG. 7 shows the processing operations done by a disk array system of the fifth RAID level. In FIG. 7, the processing operations done by a data storage disk unit and a parity-data storage disk unit, which are associated with the processing done by a disk array control unit, are shown side by side.
Next applicants' discovery of, why a deadlock occurs will be described. Described first is how a disk array system having a single-port configuration acts in response to two write requests (transactions) sent from a host computer. FIG. 8 shows a disk array system including four disk arrays 46 or ranks and having a single-port configuration that is defined with only one disk array control unit 10. Specifically, the disk array system comprises disk units 32-1 to 32-20, interfaces 234-1 to 234-5, and ranks 48-1 to 48-4 each of which is regarded as an array unit.
Assuming that the two disk units 32-4 and 32-17, which are hatched in FIG. 8, are in use, a transaction 1 is submitted as an update instruction, which is intended to update data D1 in the disk unit 32-9 and a parity data P1 in the disk 32-7, to the disk array control unit 10. Immediately after the transaction 1, a transaction 2 is submitted as an update instruction which is intended to update data D2 in the disk unit 32-7 and a parity data P2 in the disk unit 32-9. The disk units 32-4 and 32-17, and the interfaces 234-2 and 234-4 are in use. Until the disk units are released, the instructions of the transactions 1 and 2 are placed in a queue. When the disk units 32-4 and 32-17 are released, the transaction 1 is granted the use authorities of the disk units 32-7 and 32-9 via the interfaces 234-2 and 234-4.
With the D1 update instruction of the transaction 1, the old data D1 is read from the disk 32-9, new data D1 is written therein, and then the disk unit 32-9 is released. With the P1 update instruction of the transaction 1, the old parity data P1 is read from the disk unit 32-7. In this state, when reading the old data D1 is completed, a new parity data P1 is produced according to the expression (2). The new parity data P1 is then written in the disk unit 32-7. Thereafter, the disk unit 32-7 is released. The transaction 1 accesses the disk units 32-7 and 32-9 concurrently. After the processing of the transaction 1 is completed, the D2 update instruction and P2 update instruction of the transaction 2 are handled similarly. In the single-port configuration, the succeeding, transaction 2 will never use disk units before the preceding transaction. A deadlock will therefore never occur.
FIG. 9 shows a disk array system having a dual-port configuration. Two disk array control units 10-1 and 10-2 are included. Interfaces 234-1 to 234-5 and 236-1 to 236-5 serve as two-system access paths. Compared with the single-port configuration in FIG. 8, the dual-port configuration offers a double throughput in theory. When contention occurs because access requests are made for the same disk unit by the disk array control units 10-1 and 10-2, either of the control units that succeeds in obtaining the use authority is enabled to use the disk unit exclusively, while the other control unit waits until the disk is released at the termination of the previous access.
The present inventors' discovery of a deadlock occurring in the dual-port configuration in FIG. 9 will be described below. Supposedly, the disk array control unit 10-1 is using the disk unit 32-4 and the disk array control unit 10-2 is using the disk unit 32-17. The transaction 1 is submitted to the disk array control unit 10-1, and the transaction 2 is submitted to the disk array control unit 10-2 immediately after the transaction 1. The preceding transaction 1 has a D1 update instruction for updating the data D1 in the disk unit 32-9 and also has a P1 update instruction for updating the parity data P1 in the disk unit 32-7. The transaction 1 is placed in a queue. The succeeding transaction 2 has a D2 update instruction for updating the data D2 in the disk unit 32-7 and also has a P2 update instruction for updating the parity P2 in the disk unit 32-9, which is also placed in a queue.
Since the interface 234-2 for interfacing the disk array control unit 10-2 with the disk unit 32-9 is available, the use authority of the disk 32-7 can be obtained with the P1 update instruction and the parity data P1 can be read immediately. Since the interface 234-4 for interfacing the disk array control unit 10-2 with the disk unit 32-9 is also available, the use authority of the disk unit 32-9 can be obtained with the P2 update instruction and the parity data P2 can be read immediately. After reading the parity data P1 and P2 is completed, the transactions 1 and 2 retains the exclusive use of the disk units 32-7 and 32-9 respectively in order to write new parity data. The state shown in FIG. 10 is thus set up.
Using the D1 and D2 update instructions, the transactions 1 and 2 attempt to read the old data D1 and D2, which have not been updated, and produce new parity data. However, even when the transaction 1 attempts to access the disk unit 32-9 for the D1 update instruction, since the transaction 2 is using the disk unit 32-9 exclusively for the P2 update instruction, the transaction 1 fails to access. Even when the transaction 2 attempts to access the disk unit 32-7 for the D2 update instruction, since the transaction 1 is using the disk exclusively for the P1 update instruction, the transaction 2 fails to access. In other words, the transactions 1 and 2 use disk units exclusively to update parity data, generate use requests in order to use the disk units the partners are currently using exclusively, and wait for the partners to terminate. An event that both the disk units are not released to suspend processing; that is, a deadlock may therefore occurs. In short, when interrupt-disabled disk units, each of which is used handle one job at a time, are placed in a cyclic wait state, a deadlock occurs.