The present invention relates to a storage unit subsystem in a computer system, and more particularly to the realization of high performance and high reliability of a storage unit subsystem having a cache memory.
As one of storage unit subsystems used in a computer system, a disk system which uses a magnetic disk as a storage medium has been used. As a technique to realize the high performance and the high reliability of such a disk system, a disk array system disclosed in D. Patterson et al: A Case for Redundant Arrays of Inexpensive Disks (RAID), ACMSIGMOD conference proceedings, Chicago, Ill., Jun. 1-3, 1988, pp. 109-116, (hereinafter referred to as a Patterson's article) has been known. In the disk array system, one logical disk unit to a host computer is realized as a plurality of physical disk units to attain the high performance. Further, redundant data which allows the recovery of data stored in a disk unit in which a fault has occurred when such a fault occurs in the disk unit storing that data is stored in a different disk unit from the disk unit in which the data is stored to attain the high reliability.
The Patterson's article discloses several techniques to be described below as to a method for arranging redundant data on the disk array system. In a first data arrangement method, data of totally identical content are stored in two disk units and it is called RAID1 or dual writing. In a second data arrangement method, a record which is a set od data which is a read/write unit when a host unit conducts a read/write process with a logical disk unit is divided and stored into a plurality of disk units. This data arrangement method is called RAID3. In the RAID3, redundant data are generated from respective dividends divided from the record. Further, in a third data arrangement method, the record is not divided as it is in the RAID3 but one record is stored in one disk unit and redundant data is generated from a plurality of records stored in separate disk units. Such a data arrangement method includes methods called RAID4 and RAID5.
A record which stores data directly read and written by a host computer is usually called a data record, and a record which stores redundant data is usually called a parity record. A unit by which data is arranged in the disk unit is switched from one disk unit to another. This unit is called a stripe. A stripe is a set of records and a stripe comprising data records is called a data stripe and a stripe comprising parity records is called a parity stripe. Usually, in the disk array system, n (not smaller than 1) parity stripes are generated from m (not smaller than 1) data stripes. A set of m+n stripes is hereinafter called a parity group. Each of the m+n stripes is stored in separate disk units. When the number of parity stripes in the parity group is n, data in up to n disk units may be recovered even if faults occur in those disk units.
When a data record is updated, a parity record must also be updated accordingly. For example, only one data record in a parity group is updated, a new value of the parity record may be generated from a content of updating, an old value of the data record and an old value of the parity record. As a technique to efficiently conduct the update process of the parity record to realize the high performance of the system, the following techniques have been known.
For example, in a technique disclosed in JP-A-4-245342, the updating process to the data record is executed on a cache and the generation of a new value of the parity record and the writing of the data record and the parity record into the disk units are executed later. When write processes occur to data records in one parity group before the start of the generation of the new value of the parity record, the generation processing of the new values of the parity records corresponding to the plurality of write processes are collectively executed so that the high performance is attained.
PCT WO 91/20025 discloses a technique of dynamic mapping to efficiently executing a write process in the disk array. In this technique, when a write process occurs, a location on a disk unit at which the data record thereof is to be written is altered. Specifically, a parity group is formed by only written data, parity data is generated from those data and it is written into a disk of the disk array. In order to execute the above process, however, it is necessary that a data stripe of entire parity group is an empty area.
On the other hand, Technical Research Report of the Institute of Electronics, Information and Communication Engineers, DE93-45 "Evaluation of Performance of RAID5 Disk Arrays with Virtual Striping", by Mogi et al, September 1993, Technical Report Vol. 25, No. 251, pp. 69-75 (hereafter referred to as a Mogi's article) discloses more efficient technique. In the article, a location on a disk of a parity group itself is dynamically altered so that a data stripe of an entire parity group can be made empty more efficiently.
Further, JP-A-5-46324 discloses a technique to transfer information necessary for the updating of a parity record to a disk unit in which the parity record is stored and generate a new value of the parity record in the disk unit so that the number of times of data transfer between a control unit and the disk unit which occurs when the data is to be updated is reduced and the high performance of the disk array system is attained.
On the other hand, JP-A-4-230512 discloses a technique to secure empty records in the data stripe and the parity stripe on the disk unit at an appropriate proportion and write a new value of the parity record into the empty area rather than the original location to reduce a write time. The empty area described here is different one from those disclosed in PCT WO 91/20025 and the Mogi's article. In the technique disclosed in JP-A-4-230512, when the new values of the parity record and the data record are written after the completion of the generation of the new value of the parity record, they are written into the empty area rather than the original location. Thus, the original storage location of the parity record becomes empty. On the other hand, in the techniques disclosed in PCT WO 91/20025 and the Mogi's article, the new parity group is formed by only the new write data for the area in which the entire parity group is empty, and the entire parity group is written into the disk unit. The location at which the data was originally stored is managed as an empty area.
Accordingly, there are following two essential differences between those techniques:
(1) In the technique disclosed in JP-A-4-230512, the selection of the empty area is made after the generation of the new value of the parity record but in the technique disclosed in PCT WO 91/20025 or the Mogi's article, a particular location on the disk unit at which the data is written is determined before the generation of the new value of the parity record, that is, when the parity group to be written is selected.
(2) In the technique disclosed in JP-A-4-230512, the data records and the parity records in one parity group do not change. On the other hand, in the technique disclosed in PCT WO 91/20025 or the Mogi's article, a set of data records in one parity group dynamically change.
In the technique disclosed in JP-A-4-230512, the control unit which controls the disk array issues to the disk unit a total of four requests, a data record read request, a data record write request, a parity record read request and a parity record write request. The control unit further generates the new value of the parity record. For example, it is assumed that the reading of an old parity record from a storage medium is completed in a disk unit while an old data record and a new data record are prepared in the control unit. As a matter of course, the disk unit attempts to send the old parity record read from the storage medium. However, a data transfer path between the control unit and the disk unit is usually shared by other disk units and it may happen that the data transfer path is occupied and a process to send the old parity record is not immediately started. Further, since the control unit may be executing other jobs, a lot of time from the receiving of old parity record till the start of the generation of the new value of the parity record may be required. Moreover, to send the new value of the parity record to the disk unit after the completion of the generation of the new value, the control unit need secure common resources for example the disk unit or the data transfer path. If a conflict happens, the control unit need a waiting time.
It is thus seen that the method of generating the new value of the parity record in the control unit and sending the new value to the disk unit may require a considerable time from the reading of the old vale to the writing of the new value. Thus, even if an empty record is present immediately after the old value, the read/write head of a disk unit passes over the empty record before the new value is sent to the disk unit and the improvement of the performance is difficult to attain. Thus, in the technique of JP-A-4-230512, even if an area immediately following to the read area is an empty area, it is, in principle, substantially difficult to write the new value of the data record in that empty area.
On the other hand, in the technique disclosed in JP-A-5-46324, the disk unit is provided with a function to generate the new value of the parity record so that a load of the control unit may be distributed to the disk unit in some extent. However, it still involves a problem in reducing the load of the control unit. Further, since the new value of the parity record is written into a location from which the old value is read, a time of one disk rotation, at minimum, is required from the reading of the old value to the writing of the new value.