The present invention relates generally to data storage systems having user configurable levels of input/output (xe2x80x9cI/Oxe2x80x9d) performance and fault tolerance. More particularly, the present invention relates to a system, apparatus, and method for distributing data across multiple disk drives that provides exceptional levels of I/O performance and one-hundred percent data redundancy.
Disk drives in all computer systems are susceptible to failures caused by temperature variations, head crashes, motor failure, controller failure, and changing voltage conditions. Modem computer systems require, or at least benefit from, a fault-tolerant data storage system, for protecting data in the data storage system against instances of disk drive failure. One approach to meeting this need is to provide a redundant array of independent disks (RAID) system operated by a disk array controller (controller).
A RAID system typically includes a single standalone controller, or multiple independent controllers, wherein each controller operates independently with respect to the other controllers. A controller is generally coupled across one or more input/output (I/O) buses both to a an array of disk drives and also to one or more host computers. The controller processes I/O requests from the one or more host computers to the rack of disk drives. Such I/O requests include, for example, Small Computer System Interface (SCSI) I/O requests, which are known in the art.
Such a RAID system provides fault tolerance to the one or more host computers, at a disk drive level. In other words, if one or more disk drives fail, the controller can typically rebuild any data from the one or more failed disk drives onto any surviving disk drives. In this manner, the RAID system handles most disk drive failures without interrupting any host computer I/O requests.
Fundamental to RAID technology, is the concept ofxe2x80x9cstriping,xe2x80x9d or dividing a body of data, from a host computer, into data segments and distributing the data segments in a well-defined manner across each disk drive in the disk drive array. In this manner, the disk drive array becomes, in effect, one logical storage unit as far as a host computer is concerned. There are a number of well known data striping techniques, or RAID levels, including RAID levels 0-6. A numerically higher RAID level does not imply an increase to the disk subsystem""s fault tolerance (reliability), I/O performance and scalability. Instead, the numerical levels refer to different techniques that balance various levels of reliability, I/O performance and scalability.
To illustrate this balance, consider that RAID level 0 has exceptional I/O performance because, as data is written to or read from the disk drive array in response to a group, or an ensemble of I/O requests, each disk drive, or spindle in the array comes into play to satisfy the I/O requests. Optimal I/O performance is realized in systems that use RAID level 0, because each disk drive, or spindle in the array comes into play to satisfy the ensemble of I/O requests.
However, RAID level 0 is redundant in name only, and offers no fault tolerance. If RAID level 0 were fault tolerant, the techniques typically used to provide fault tolerance would slow down the I/O performance typically available through the use of RAID level 0. Because RAID level 0 is not fault tolerant, it is not a viable solution in systems that require reliability.
Fault tolerance in case of disk drive failure is typically provided by a number of different techniques. These techniques include disk drive mirroring and data mirroring. Disk drive mirroring involves duplicating an original datum that is stored on a first disk drive, and storing the duplicate datum on a second disk drive. RAID levels 1 and 0+1 use disk drive mirroring to provide fault tolerance to a data storage subsystem. Disk drive mirroring also provides one-hundred percent redundancy of data that virtually eliminates RAID system interruption due to a single disk drive failure.
There are a number of problems with data striping techniques (RAID levels) that use disk drive mirroring to increase fault tolerance. One problem is that disk drive mirroring sacrifices I/O performance for fault tolerance. For example, consider that in a data storage subsystem implemented with either RAID level 1 or RAID level 0+1, only one-half of the disk drives are used to satisfy any read request from a host computer. The disk drives that are used to satisfy a read data request are the disk drives have original data stored on them. (The other one-half of the disk drives only come into play only if a primary disk drive fails, wherein the duplicate data is used to satisfy the read request). As noted above, optimal I/O performance is only realized if each disk drive, or spindle in the array comes into play to satisfy the I/O request. Therefore, RAID levels that use disk drive mirroring are not viable solutions for systems that require fast response to read data requests.
RAID level 6 data striping techniques use data mirroring, as compared to disk drive mirroring. Data mirroring also means that each original data is mirrored across the disk drives. However, using data mirroring, original data is typically not mirrored on a dedicated mirror disk drive, as is done in RAID levels that use disk drive mirroring. This means that it is possible to distribute the data across the disk drives in a manner that provides optimal read data request performance.
To illustrate data mirroring according to RAID level 6, refer to Table 1, where there are shown aspects of RAID level 6 data striping techniques according to the state of the art.
The first three vertical columns represent disk drives 1-3 and are respectively labeled xe2x80x9cDrive 1xe2x80x9d, xe2x80x9cDrive 2xe2x80x9d, and xe2x80x9cDrive 3xe2x80x9d. Horizontal rows, stripes 0-3, represent xe2x80x9cstripes of data,xe2x80x9d where original and duplicate data are respectively distributed across the disk drives 1-3 in the disk drive 1-3 array. Original data is stored on disk drives 1-3 respectively in data segments A, B, C, D, E, and F. Mirrored data, or duplicate data are respectively stored on disk drives 1-3 in data segments Axe2x80x2, Bxe2x80x2, Cxe2x80x2, Dxe2x80x2, Exe2x80x2, and Fxe2x80x2. For example, data segment Axe2x80x2 contains a duplicate of the original data contained in data segment A, Bxe2x80x2 contains a duplicate of the original data contained in B, Cxe2x80x2 contains a duplicate of the original data contained in C, and the like.
Stripe 0 includes original data in data segments A-C, and stripe 1 contains respective duplicates of original data in data segments Axe2x80x2-Cxe2x80x2. Stripe 2 includes original data in data segments D-F, and stripe 3 contains respective duplicates of original data in data segments Dxe2x80x2-Fxe2x80x2. As can be seen, RAID level 6 stores duplicate data in data segments Axe2x80x2-Fxe2x80x2 on different disk drives 1-3 than the corresponding original data in data segments A-F. To accomplish this, the RAID level 6 data striping algorithm will rotate to the right by one data segment, a copy of the original data in each respective data segment in the immediately proceeding stripe.
This rotation to the right by one data segment before writing the duplicate data introduces an undesirable amount of rotational delay into a data storage subsystem that uses RAID level 6. Such rotational delay slows down the data storage subsystem performance in response to sequential write data requests. To understand why this is the case, it is helpful to understand how a write data request is handled by a disk drive 1-3.
Each disk drive 1-3 is organized into a plurality of platters, each of which has two recordable disk surfaces. (Individual platters and disk surfaces are not shown) Each platter surface is divided into concentric circles called xe2x80x9ctracksxe2x80x9d. Each track is in turn divided into a plurality ofxe2x80x9csectorsxe2x80x9d. Each sector has an associated logical block address (LBA). (Such disk drive 1-3 organization is known in the art).
The first step to a write data onto a platter of a disk drive 1-3, is for a read/write disk head (disk head) to move until it is over the proper track. (Individual read/write disk heads are not shown). This operation is called a xe2x80x9cseekxe2x80x9d, and the time to move the disk head until it is over the the proper track is called the xe2x80x9cseek timexe2x80x9d. Once the correct track has been reached, we must wait for the desired sector to rotate under the disk head. This time is called the xe2x80x9crotational delayxe2x80x9d.
A simple example can be used to illustrate rotational delay. Referring to Table 1, it can be seen that before duplicate data can be written into data segment Axe2x80x2 in disk drive 2, the platter in disk drive 2 must be rotated until the correct logical block (LB) is under the disk head. Although individual LBs are not shown, the correct LB includes the start of a desired data segment A-Gxe2x80x2. In this example, the correct LB contains the start of data segment Axe2x80x2. (LBs are organized in a disk drive 1-3 in a sequential manner, such that a first LB has a lower LBA than a second, subsequent LB).
To process a next, sequential write data request, illustrated by the data in data segment B, the platter in disk drive 2 must be rotated until a LB with a lower LBA is underneath the disk head. The amount of platter rotation required to write this next data into data segment B is nearly a complete, 360 degree platter rotation. To process a next, sequential write data request, illustrated by the data in data segment B, the platter in disk drive 2 must be rotated until a LB with a lower LBA is underneath the disk head. Only at this point will the next data be written into data segment B.
The procedure of writing data to a disk drive 1-3 by rotating the platter in a disk drive 1-3 from a LB with a higher LBA, to a LB with a lower LBA, is known as a backward write. As a general rule, using RAID level 6 to sequentially stripe data across disk drives 1-3, every disk drive 1-3 other than the first disk drive 1-3 will be required to perform backward writes. For example, disk drives 2-3 are shown to have performed backward writes to write data into respective data segments B, C, E and F.
Ideally, the data, original or duplicate data, would always be stored in a respective disk drive 1-3 by rotating the platter in the respective disk drive 1-3 to a LB with a higher LBA for each subsequent write data operation, without requiring such backward writes. Otherwise, as is evidenced by using RAID level 6 techniques to stripe data across disk drives 1-3, such sequential backward writes slow down data storage subsystem performance by introducing undesirable amounts of rotational delay into the data storage subsystem. Therefore, RAID level 6 is not a viable solution for data storage subsystems that require high write data performance.
Another problem with the state of the art data striping techniques is that they are not typically scalable across either an even or an odd number of disk drives. It would be cost-efficient and desirable for a data striping technique to be scalable across either an even number or an odd number of disk drives, so that available hardware resources can be fully utilized. For example, RAID level 1 requires an even number of disk drives, and is not scalable to an odd number of disk drives, and RAID level 0+1 requires an even number of disk drives. Neither of these RAID levels are scalable across an odd number of disk drives.
In light of the above, what is needed is a new procedure for striping data across disk drives in a disk drive array that delivers exceptional, or RAID-0 levels of I/O performance for sequential I/O requests without sacrificing high levels of reliability. To accomplish this, the desired data striping technique will not perform backward writes in response to sequential write data requests. Additionally, the desired data striping technique will be scalable across either an even number of disk drives or an odd number of disk drives greater than two disk drives. (The number of disk drives is greater than 2 disk drives because at least 2 disk drives are required to provide data redundancy to a data storage system).
Heretofore, the state of the art was limited by data storage and retrieval procedures that: (a) while providing for 100% data redundancy, do not provide optimal performance for sequential write data requests; and, (b) are not typically scalable across both an even or an odd number of disk drives. The present invention provides a solution for these limitations.
In one aspect, the present invention a controller receives a plurality of write data requests from a host computer. Each write data request includes data. In response to receiving the write data requests, the controller stores the data across the disk drives according to a data striping procedure. In a data stripe that includes substantially original data, the data are distributed across the disk drives according to a first rule. In a data stripe that includes substantially duplicate data, data are distributed across the disk drives according to a second rule. The data stripes that have substantially original data are interleaved with the data stripes that have substantially duplicate data.