The present invention relates to a technique for controlling a disk array, and more particularly to a technique for enhancing the efficiency and reliability of a process for generating and writing redundant data to a disk array storage devise.
In 1987, David A. Patterson, et al. reported a technique for saving redundant data in storage devices. See David A. Patterson, xe2x80x9cA Case for Redundant Arrays of Inexpensive Disks (RAID),xe2x80x9d University of California at Berkeley, Computer Science Division Report UCB:CSD-87-391 (December 1987); see also David A. Patterson, xe2x80x9cA Case for Redundant Arrays of Inexpensive Disks (RAID),xe2x80x9d ACM SIGMOD conference proceeding, Chicago, Ill, Jun. 1-3, 1988, pp. 109-116. This technique is based on a generalized method of concatenating multiple physical storage devices into one logical storage unit. When a high level device (e.g., host computer or dedicated controller) writes user data to this type of logical storage unit, the data may be divided into a number of parts corresponding to the number of physical storage devices. At the same time, redundant data may also be generated and stored among the several physical storage devices. In the event that one of the physical storage devices fails, the stored redundant data can facilitate the discovery of stored user data. The Patterson document described the following two methods for generating redundant data.
The first method for generating redundant data, referred to herein as the xe2x80x9cmethod of read and modify,xe2x80x9d generates redundant data from the combination of three sources: (1) write data from a high level device, (2) previous data stored in a storage device where the write data is to be stored, and (3) previous redundant data stored in a storage device where newly generated redundant data is to be stored.
Assuming that the write data is divided into xcex1 partitions, this method of generating redundant data requires xcex1 read operations to access previously stored data, xcex1 write operations to store the updated write data, one read operation to retrieve the previous redundant data, and one write operation to store the updated redundant data. To generate and store redundant data in this example, (xcex1+1) read operations and (xcex1+1) write operations are required, totaling (2xcex1+2) input and output operations. If the use of redundant data is unnecessary, only xcex1 (write) operations would be required to store the data. This means that, using this method of generating redundant data, an additional (xcex1+2) input and output operations are required.
The second method for generating redundant data, referred to herein as the xe2x80x9cmethod of all stripes,xe2x80x9d generates redundant data from partitioned write data received from the high level device and from previous data read from the storage devices. In this method, however, the devices which store redundant data do not read previously-stored redundant data.
With this method, assume again that the write data is divided into xcex1 partitions Assume also that the number of storage devices, except those for saving redundant data, is xcex2, and further that xcex1 less than =xcex2. In this method, then, the total number of input operations to and output operations from the storage devices is (xcex2+1), wherein the number of input operations is (xcex2xe2x88x92xcex1) and the number of output operations, including those containing redundant data, is xcex1+1. If redundant data is not necessary, only xcex1 (write) operations would be required to store the data. This means that the additional number of input and output operations required is (xcex2+1xe2x88x92xcex1), when redundant data is generated by this second xe2x80x9cmethod of all stripes.xe2x80x9d
Apart from the foregoing methods, a method for generating redundant data in storage devices has been disclosed in U.S. Pat. No. 5,613,088, wherein the storage device uses two heads for simultaneous reading and writing. Specifically, the read head and the write head are fixed on a common actuator. During the data update process, the read head reads existing parity data and then the write head follows, updating the same area with new parity data generated, in part, from the old parity data.
The foregoing two methods for generating redundant data, that is, the xe2x80x9cmethod of read and modifyxe2x80x9d and the xe2x80x9cmethod of all stripes,xe2x80x9d increase the number of input and output (I/O) operations associated with storing data on a disk. This means that the disk control device with redundancy is inferior in performance to the disk control device without redundancy. Hence, according to the present invention, a conventional disk control device with redundancy is made to selectively employ the redundant data generation method that results in a smaller number of I/O operations to and from the storage device. This selection makes it possible to reduce the burden on the storage device and thereby improve the processing speed. Specifically, in the case of (xcex1 greater than =(xcex2xe2x88x921)/2), the xe2x80x9cmethod of all stripesxe2x80x9d will use a smaller number of storage device I/O operations than the xe2x80x9cmethod of read and modify,xe2x80x9d while in the case of (xcex1 less than (xcex2xe2x88x921)/2), the xe2x80x9cmethod of read and modifyxe2x80x9d will use a smaller number. Therefore, if the length of the write data received from the high level device is in the range of (xcex1 less than (xcex2xe2x88x921)/2), for example, in the case of a short transaction process, a disk control device that is configured to use the present invention will select the xe2x80x9cmethod of read and modifyxe2x80x9d to generate redundant data.
The number of I/O operations using the xe2x80x9cmethod of read and modifyxe2x80x9d is minimized at four (4) when xcex1=1. This means that when xcex1=1, performance cannot be improved further unless the method of processing is reconsidered. The problem with the xe2x80x9cmethod of read and modifyxe2x80x9d is essentially based on the fact that two I/O operations must be issued to the storage device for each partition of data. With each I/O operation there is significant overhead associated with such factors as movement of the head and rotational latency. This mechanical overhead is a great bottleneck on disk control devices.
The method disclosed in U.S. Pat. No. 5,613,088 makes it possible to generate redundant data in a storage device configured with a read head and a write head mounted on a single actuator. Expanding this method to a general storage device provided with a single read-write head, the resulting method, referred to herein as the method of xe2x80x9cgeneration in a drive,xe2x80x9d employs the following steps. First, the data to be written to the disk drive device (the xe2x80x9cwrite dataxe2x80x9d) and the existing data that will eventually be updated with the xe2x80x9cwrite dataxe2x80x9d (the xe2x80x9cdata before updatexe2x80x9d) are transferred to the actual physical storage device that is responsible for generating and storing the redundant data. Within this redundant data physical storage device, the existing redundant data that will be updated (the xe2x80x9credundant data before updatexe2x80x9d) is read, and the new redundant data is generated from the combination of the xe2x80x9cwrite data,xe2x80x9d the xe2x80x9cdata before update,xe2x80x9d and the xe2x80x9credundant data before update.xe2x80x9d In this method of xe2x80x9cgeneration in a drive,xe2x80x9d the head is positioned to read the xe2x80x9credundant data before update,xe2x80x9d and the updated redundant data is calculated, and when the disk reaches the next writing position, the write operation is started and the updated redundant data is stored. This operation makes it possible to avoid the spinning on standby that normally occurs during the interval between reading and writing, and merely requires one movement of the head and one standby spin. As a result, if the length of the data from the high level device is short, the processing speed of the control device can be improved further.
However, the method of xe2x80x9cgeneration in a drivexe2x80x9d cannot always calculate and store redundant data in the most efficient manner. If the length of the generated redundant data is longer than can be stored within one spin of the disk, the method of xe2x80x9cgeneration in a drivexe2x80x9d will require the disk ti spin on standby during the interval between reading the xe2x80x9credundant data before updatexe2x80x9d and writing the updated redundant data. This additional spinning on standby increases the time required by the drive to save the updated redundant data, and thereby increases the response time of the disk array device. Therefore, if the length of the redundant data is longer than can be stored within one spin of the disk, the method of xe2x80x9cgeneration in a drivexe2x80x9d will have a greater response time than if the redundant data could be stored within one spin of the disk.
The method of xe2x80x9cgeneration in a drivexe2x80x9d is designed to increase the volume of xe2x80x9cdata before updatexe2x80x9d read from the component disk drives, as the number of partitions of xe2x80x9cwrite dataxe2x80x9d received from the high level device increases, thereby increasing the load placed on the data storage device. Hence, if the number of partitions of xe2x80x9cwrite dataxe2x80x9d is great, the method of xe2x80x9cgeneration in a drivexe2x80x9d disadvantageously lowers the throughput of the disk array device.
With the method of xe2x80x9cgeneration in a drive,xe2x80x9d it is possible to increase the amount of time the redundant data disk drive is busy during each spin of the disk, as compared to the xe2x80x9cmethod of read and modify.xe2x80x9d This increases the burden placed on the redundant data disk drive in a highly multiplexed and high load environment. Hence, the method of xe2x80x9cgeneration in a drivexe2x80x9d may enhance the probability that the redundant data disk drive will be in use, thereby lowering the throughput of the drive.
When write data is transferred from the high level device to the disk control device, together with an explicit specification of consecutive pieces of data, the method of xe2x80x9cgeneration in a drivexe2x80x9d operates to immediately generate redundant data for the transferred write data. As a result, when the succeeding write data is transferred from the high level device, the xe2x80x9cmethod of all stripesxe2x80x9d may lose the chance of generating redundant data corresponding to the first write data. Hence, if the method of xe2x80x9cgeneration in a drivexe2x80x9d cannot use the xe2x80x9cmethod of all stripes,xe2x80x9d this disadvantageously lowers the efficiency of generating redundant data, thereby degrading the throughput of the disk array device.
Finally, with the method of xe2x80x9cgeneration in a drive,xe2x80x9d the generation of redundant data may become unsuccessful upon the occurrence of any drive-related failure, such as the inability to read the xe2x80x9credundant data before update.xe2x80x9d If this kind of failure occurs, the redundancy of the Error Correcting Code (ECC) group of the component disk drives of the disk array device may be lost at once.
It is an object of the present invention to avoid the spinning on standby that occurs when the data length of the write data from a high level device is longer than can be stored within one spin of the disk, and to improve the response time of a disk array device when it generates and stores redundant data.
It is a further object of the present invention to improve the throughput of a disk array device by selecting the most appropriate method for generating redundant data, so that the necessary number of reads of the xe2x80x9cdata before updatexe2x80x9d is minimized relative to the number partitions of write data received from the high level device.
It is yet a further object of the present invention to reduce the amount of time the redundant data disk drive is busy during each spin of the disk.
It is another object of the present invention to improve the throughput of a disk array device by enhancing the efficiency of generating redundant data in association with the required access pattern (e.g., sequential access, indexed access, etc.) for the write data, as specified by the high level device.
It is still another object of the present invention to enhance the reliability of the process of generating redundant data in disk array storage devices.
According to the invention, a disk array device having a plurality of disk drives composing a disk array, and a disk control device for controlling those disk drives includes a plurality of methods for generating redundant data and a control logic for selectively executing at least one of those methods when a high level device requests the disk array device to store supplied data in a redundant fashion.
The disk array device, including a plurality of disk drives composing a disk array and a disk control device for controlling those disk drives, is dynamically switched from the generation of redundant data in the disk control device to the generation of the redundant data inside of the disk drives according to an operating status.
Specifically, as an example, the disk array device according to the invention includes the following components.
That is, the disk array device composing a plurality of disk drivers, which is arranged to set a logic group of a partial set of the disk drives and save the redundant data in part of that logic group for the purpose of recovering fault data from the normal disk drives when some of the disk drives are disabled by temporary or permanent failure, provides methods for generating redundant data in each of the disk drives.
The disk control device contains: (1) a first redundant data generating circuit that generates new redundant data from partitioned write data received from a high level device, the previous data to be updated by the partitioned write data, and the redundant data of the data group of the partitioned write data, (2) a second redundant data generating circuit that generates new redundant data of the data group from the data that is not updated by the partitioned write data contained in the data group, and (3) a selective control circuit for selecting a difference data generating circuit for generating difference data from: (a) the partitioned write data received from the high level device and (b) the previous data updated by the partitioned write data and one of the redundant data generating circuits.
Further, the disk control device provides a method of determining the length of the write data received from the high level device, a method of detecting the utilization of each redundant data disk drive, a method of determining if the transfer of consecutive pieces of data from the high level device has been explicitly specified, and a method of detecting if the generation of redundant data within the redundant data disk drive has failed. The selective control circuit operates to select a proper method for generating the redundant data.
An example of a preferred embodiment of the disk array device and the method for controlling the disk array device as described above is provided as follows.
The disk array device operates to determine the length of the write data sent from the high level device, and then generate redundant data in the redundant data disk drive if the data length is determined to be shorter than can be stored within one spin of the disk. Hence, the disk array device operates to suppress the spinning on standby within the redundant data disk drive, thereby improving the throughput of the disk array device.
If the length of the write data sent from the high level device is determined to be longer than can be stored within one spin of the disk, the difference data between the partitioned write data and the xe2x80x9cdata before updatexe2x80x9d stored on the raw data disk drive(s) is transferred onto the redundant data disk drive. The redundant data disk drive then operates to generate redundant data from the difference data and the xe2x80x9credundant data before update,xe2x80x9d thereby suppressing the spinning on standby within the redundant data disk drive and improving the throughput of the disk array device accordingly.
The method for controlling the disk array device is executed to determine the utilization of the redundant data disk drive, generate the redundant data in another disk control device without having to execute the method of xe2x80x9cgeneration in a drivexe2x80x9d if the utilization is determined to be greater than or equal to a given value, for the purpose of distributing the load associated with generation of the redundant data. In a highly multiplexed and high load environment, by suppressing the increase of the load placed on the redundant data disk drive, it is possible to lower the probability that the redundant data disk drive will be in use and thereby improve the throughput of the disk array device.
The method for controlling the disk array device is executed to determine if the transfer of consecutive pieces of data from the high level device to the disk control device has been explicitly specified and to generate redundant data in the short time after the write data reaches a sufficient length without immediately generating the redundant data, if the explicit transfer of consecutive data is specified. This enables an improved efficiency of generating redundant data and improves the throughput of the disk array device.
When the method of xe2x80x9cgeneration in a drivexe2x80x9d fails to generate redundant data, the method for generating the redundant data is switched to another method. This makes it possible to increase the chances of recovering from the failure and thereby improving the reliability of the disk array device.