1. Field of the Invention
This invention relates to computer data storage systems, and more particularly to arrays of storage devices that employ mirroring and striping of data, such as certain Redundant Array of Independent Disks (RAID) systems, and to mechanisms for load balancing in such storage systems when operating in a degraded mode such as when reconstructing a failed storage device.
2. Description of the Related Art
A continuing desire exists in the computer industry to consistently improve the performance of computer systems over time. For the most part this desire has been achieved for the processing or microprocessor components of computer systems. Microprocessor performance has steadily improved over the years. However, the performance of the microprocessor or processors in a computer system is only one component of the overall performance of the computer system. For example, the computer memory system must be able to keep up with the demands of the processor or the processor will become stalled waiting for data from the memory system. Generally computer memory systems have been able to keep up with processor performance through increased capacities, lower access times, new memory architectures, caching, interleaving and other techniques.
Another critical component to the overall performance of a computer system is the I/O system performance. For most applications the performance of the mass storage system or disk storage system is the critical performance component of a computer""s I/O system. For example, when an application requires access to more data or information than it has room in allocated system memory, the data may be paged in/out of disk storage to/from the system memory. Typically the computer system""s operating system copies a certain number of pages from the disk storage system to main memory. When a program needs a page that is not in main memory, the operating system copies the required page into main memory and copies another page back to the disk system. Processing may be stalled while the program is waiting for the page to be copied. If storage system performance does not keep pace with performance gains in other components of a computer system, then delays in storage system accesses may overshadow performance gains elsewhere.
One method that has been employed to increase the capacity and performance of disk storage systems is to employ an array of storage devices. An example of such an array of storage devices is a Redundant Array of Independent (or Inexpensive) Disks, RAID. A RAID system improves storage performance by providing parallel data paths to read and write information over an array of disks. By reading and writing multiple disks simultaneously the storage system performance may be greatly improved. For example, an array of four disks that can be read and written simultaneously may provide a data rate almost four times that of a single disk. However, using arrays of multiple disks comes with the disadvantage of increasing failure rates. In the example of a four disk array above, the mean time between failure (MTBF) for one of the four disks of the array will be one-fourth that of a single disk. It is not uncommon for storage device arrays to include many more than four disks, shortening the mean time between failure from years to months or even weeks. RAID systems address this reliability issue by employing parity or redundancy so that data lost from a device failure may be recovered.
One common RAID technique or algorithm is referred to as RAID 1. In a RAID 1 system all data is mirrored within the storage system. In other words, a duplicate copy of all data is maintained within the storage system. Typically, a RAID 1 system performs mirroring by copying data onto two separate disks. Thus a typical RAID 1 system requires twice the number of disks. In general, one disadvantage of RAID 1 systems is that they may not provide for load balancing over multiple disks within the system. For example, the data used by a given application may be located all on one disk of the RAID 1 system. If the bulk of storage system accesses are being generated by that one application, then the storage system load will be concentrated on a single device, thus negating the performance advantage of having an array of disks.
RAID 0 is an example of a RAID algorithm used to improve performance by attempting to balance the storage system load over as many of the disks as possible. RAID 0 implements a striped disk array in which data is broken down into blocks and each block is written to a separate disk drive. Thus technique may be referred to as striping. Typically, I/O performance is improved by spreading the I/O load across multiple drives since blocks of data will not be concentrated on any one particular drive. However, a disadvantage of RAID 0 systems is that they do not provide for any data redundancy and are thus not fault tolerant.
RAID 5 is an example of a RAID algorithm that provides some fault tolerance and load balancing. In RAID 5 systems both data and parity information are striped across the storage device array. RAID 5 systems can withstand a single device failure by using parity information to rebuild a failed disk. However, write performance may suffer in RAID 5 systems due to the overhead of calculating parity information. However, only one additional disk is required to store parity information as opposed to the 2X number of disks required for typical RAID 1 systems.
Another RAID technique referred to as RAID 10 or RAID 0+1 attempts to combine the advantages of both mirroring and striping. FIG. 1 illustrates how data is stored in a typical RAID 10 system. Data is stored in stripes across the devices of the array. FIG. 1 shows data stripes A, B, . . . X stored across n storage devices. Each stripe is broken into stripe units, where a stripe unit is the portion of a stripe stored on each device. FIG. 1 also illustrates how data is mirrored on the array. For example, stripe unit A(1) is stored on devices 1 and 2, stripe unit A(2) is stored on devices 3 and 4, and so on. Thus, devices 1 and 2 form a mirrored pair, as do devices 3 and 4, etc. As can be seen from FIG. 1, this type of system will always require an even number of storage devices (2X the number of drives with no mirroring). This may a disadvantage for some users who have a system containing an odd number of disks. The user may be required to either not use one of his disks or buy an additional disk.
A storage array is said to enter a degraded mode when a disk in the array fails. This is because both the performance and reliability of the system (e.g. RAID) may become degraded. Performance may be degraded because the remaining copy (mirror copy) may become a bottleneck. To reconstruct a failed disk onto a replacement disk may require a copy operation of the complete contents of the mirror disk for the failed disk. The process of reconstructing a failed disk imposes an additional burden on the storage system. Also, reliability is degraded since if the second disk fails before the failed disk is replaced and reconstructed the array may unrecoverably lose data. Thus it is desirable to shorten the amount of time it takes to reconstruct a failed disk in order to shorten the time that the system operates in a degraded mode.
In the example of FIG. 1, if device 1 fails and is replaced with a new device, the data that was stored on device 1 is reconstructed by copying the contents of device 2 (the mirror of device 1) to the new device. During the time the new device is being reconstructed, if device 2 fails, data may be completely lost. Also, the load of the reconstruction operation is unbalanced. In other words, the load of the reconstruction operation involves read and write operations between only device 2 and the new device.
Turning now to FIG. 2, an example of a storage array is shown that attempts to overcome some of the disadvantages associated with the array of FIG. 1. In the array of FIG. 2, data is also striped across the devices of the array. Data stripes A through X are shown, where X may be any number of data stripes within the capacity of the array. Each stripe each divided into stripe units with each device storing a different stripe unit of a data stripe. The xe2x80x9coriginalxe2x80x9d stripe units are mapped sequentially to consecutive devices. Each data stripe is also mirrored across the array. However, instead of mirroring the data stripes by duplicating each disk as in FIG. 1, the mirrored data stripes are stored on the array skewed by one device position from the original data stripe. Note that the terms xe2x80x9coriginalxe2x80x9d and xe2x80x9cmirroredxe2x80x9d are simply used to provide a frame of reference but in normal operation there is no difference between an xe2x80x9coriginalxe2x80x9d data stripe xe2x80x9cmirroredxe2x80x9d data stripe. As shown in FIG. 2, the mirrored data stripes are all skewed together by one device. For example, original stripe unit A(1) is stored on device 1 and the mirrored copy of stripe unit A(1) is stored on device 2. Likewise, original stripe unit B(1) is stored on device 1 and the mirrored copy of strip unit B(1) is stored on device 2. Thus, all mirrored stripe units are skewed by one device position. By skewing the mirrored data several improvements over the system of FIG. 1 are achieved. Although the system of FIG. 2 still requires double the amount of storage capacity since all data is mirrored, the mirroring of data may be accomplished over an even or odd number of drives.
In the system of FIG. 2 each half of a device is mirrored in half of one other device. For example, for device 2, original stripe units A(2), B(2) . . . X(2) are mirrored in device 3 (not shown), and device 2 also contains the xe2x80x9cmirrorxe2x80x9d copies of stripe units A(1), B(1) . . . X(1) from device 1. If device 2 fails the replacement for device 2 may be reconstructed by reading data from both device 3 and device 1 and writing that data to the replacement device 2. Thus, the system of FIG. 2 provides some load balancing improvement over the system of FIG. 1 in that reconstruction reads are now spread over two devices, although reconstruction writes are still focused on one device.
For large arrays of storage devices, the MTBF for any one of the devices may be fairly short. Thus large arrays may spend a significant amount of time operating in a degraded mode. To improve degraded mode operation, it would be desirable to improve the load balancing of the reconstruction operation.
An array of storage devices may be provided in which data is both striped and mirrored across the array. Data may be organized in stripes in which each stripe is divided into a plurality of stripe units. The stripe units may be mapped sequentially to consecutive storage devices in the array for each data stripe. Each data stripe is also mirrored within the array as a mirrored data stripe. Each mirrored data stripe is also divided into a plurality of stripe units. The stripe units of the mirrored stripes are distributed throughout the array according to a mapping that provides for load balancing during a reconstruction operation. According to one embodiment, stripe units for mirrored stripes are distributed according to a rotational group such that each mirrored stripe is rotated on the array by one more position than the previous mirrored stripe and wherein the rotational group is repeated as necessary. Alternatively, the mirrored stripe units may be distributed according to other permutations to improve load balancing during reconstruction of a failed device. In other embodiments, in addition to mapping mirrored stripe units to balance read operations during reconstruction, one or more spare storage devices may be striped throughout the array to improve load balancing for write operations during reconstruction.
In one embodiment, an array of storage devices having at least three storage devices is configured to store stripes of data. A first stripe of data is stored as a plurality of stripe units stored consecutively across consecutive ones of the storage devices. One or more additional stripes of data are also stored as pluralities of stripe units in the same consecutive order as the first stripe of data across consecutive storage devices. A copy of the first stripe of date is stored as copies of the first stripe units. Each one of the copies of the stripe units from the first stripe of data is stored on a different one of the storage devices than the one of the stripe units of which it is a copy. A copy of each of the one or more additional stripes of data are also stored. Each one of the stripe unit copies for the copies of additional data stripes are stored on a different one of the storage devices than the stripe unit of which it is a copy. The copied or mirrored stripe units of the first data stripe are stored on the storage devices in a first order. The copied, or mirrored, stripe units for a second data stripe are stored on the storage devices in a second order. Wherein the first order is different than the second order. In one embodiment, the order in which the copied or mirrored stripe units from the first data stripe are stored is the order by which the first data stripe is stored rotated by one storage device and the second order is that order rotated by two storage devices. For additional copied data stripes, the stripe units are stored in increasing rotational order until the rotational group is repeated. The rotational group may be repeated as often as necessary for additional data stripe copies (mirror copies).
One embodiment of a storage system may include an array of at least four storage devices and a storage controller coupled to the storage devices. A storage controller may be configured to store data in stripes across the storage devices. The storage controller may further be configured to mirror each stripe of data on the storage devices. Additionally, the storage controller may be configured to perform a reconstruction operation to reconstruct lost data from a failed storage device. The reconstruction operation may include reading different portions of the lost data from at least three of the remaining storage devices of the array. In a preferred embodiment, the reconstruction operation includes reading different portions of the lost data from all of the remaining storage devices of the array. The array may include an even or odd number of storage devices. The reconstruction operation may also include writing different portions of the lost data to two or more remaining storage devices of the array. Alternatively, lost data may be written to a single replacement device.
A storage system may include an array of n storage devices where n is greater than 2. The storage system may also include a storage controller coupled to the storage devices. The storage controller may be configured to store a group of data stripes across the storage array. Each data stripe may have a plurality of stripe units with each stripe unit stored on different ones of the storage devices. The storage controller may be further configured to store a copy of the group of data stripes on the storage devices. The copy of the group of data stripes may include a series of data stripe copies. Each data stripe copy includes stripe units stored on the storage devices in a rotated position from the stripe units of which they are copies. The stripe units of a first one of the series of data stripe copies is rotated by one storage device position and the stripe units of each one of the other data stripe copies for the group is rotated by one more position than the previous data stripe copy. Additional groups of data stripes and data stripe copies may be included in which this rotational positioning is repeated. The order of the group of data stripes is the same for each data stripe whereas the order for the group of data stripe copies follows the rotational group. The group of data stripes may have nxe2x88x921 data stripes. Each data stripe may include n stripe units. Alternatively, each data stripe may include nxe2x88x921 stripe units and a spare stripe unit. Also, embodiments may be included in which fewer stripe units and more spare units are included. The spare stripe unit for each data stripe in a group is stored on a different storage device in one embodiment.
A method for distributing copies of data in an array of storage devices may include storing a first data stripe across the storage devices. The first data stripe may have a plurality of first stripe units with each one of the first stripe units stored on different ones of the storage devices. The method also may include storing a copy of the first stripe unit on the storage devices. Stripe units of the copy of the stripe are rotated by one position from the first stripe. The method further may include storing a series of additional data stripes and copies of the additional data stripes where the copies of the additional data stripes are rotated by one more storage device position than a previous additional data stripe copy so that the ordering of the data stripe copies follows a rotational group.
A method for storing data in an array of storage devices may include storing data in stripes across the storage devices and mirroring each stripe of data on the storage devices. The method may further include reconstructing lost data from a failed one of the storage devices. The reconstructing may include reading different portions of the lost data from at least three of the remaining storage devices of the array. The reconstructing may also include writing different portions of the lost data to two or more devices of the array.
Generally speaking, a method for storing data on an array of storage devices in which data is striped across the storage devices may include mirroring the original data stripes in which mirrored data stripes are stored on the storage devices in a different position than other mirrored data stripes of a same group of mirrored data stripes and in a different position from the original data stripe of which it is a mirror copy. Each group may have at least three mirrored data stripes and the mirroring may be repeated for additional groups. The method may also include storing a spare data stripe across the storage devices for each group of original and mirrored data stripes.