1. Field of the Invention
Embodiments of the present invention relate to the field of data storage systems. More particularly, embodiments of the present invention relate generally to the leveraging of spare disks in a data storage system to provide redundant storage and the reconstruction of data in multiple arrays of disk drives after failure of a disk drive.
2. Related Art
Secondary data storage is an integral part of large data processing systems. A typical data storage system in the past utilized a single, expensive magnetic disk for storing large amounts of data. This single disk in general is accessed by the Central Processing Unit (CPU) through a separate Direct Memory Access (DMA) controller. The DMA controller then translates and executes the Input/Output (I/O) requests of the CPU. For single disk memory storage systems, the speed of data transfer to and from the single, large disk is much slower than the processing speed of the CPU and acts as a data processing bottleneck.
In response, redundant arrays of independent disks (RAIDs) have evolved from the single disk storage systems in order to match the speed of secondary storage access with the increasingly faster processing speeds of the CPU. To increase system throughput, the RAID architecture of secondary storage allows for the concurrent access of data from multiple disk drives.
The concept for the RAID architecture was first formalized in an article written by some members of the Department of Electrical Engineering and Computer Sciences at the University of California at Berkeley, entitled: “A Case for Redundant Arrays of Inexpensive Disks (RAID),” by D. A. Patterson, G. Gibson, and R. H. Katz, ACM SIGMOD Conference, Chicago, Ill., June 1988, hereinafter referred to as “Patterson et al.”
Typically, RAID architectures consist of one or more host interface controllers connected to several peripheral interface controllers via a high speed data bus. Each peripheral interface controller is, in turn, connected to several individual disk drives which provide the secondary storage for the connected hosts. Peripheral interface controllers, also referred to as array controllers herein, can be connected to the disk drives via common communication interfaces (e.g., SCSI). Generally, the speed of the data bus is greater than the speed of the interface between the disk drives and the peripheral interface controllers.
In order to reconstruct lost data in a redundancy group due to a failed disk, the system must define a reversible mapping from the data and its redundancy data in the group containing the lost data. Patterson et al. describe in their paper several such mappings. One such mapping is the RAID level 4 (RAID-4) mapping that defines a group as an arbitrary number of disk drives containing data and a single redundancy disk. The redundancy disk is a separate disk apart from the data disks.
Another mapping, RAID level 5 (RAID-5) distributes the redundancy data across all the disks in the redundancy group. As such, there is no single or separately dedicated parity disk. As the number of disks in a RAID-5 array increases, the potential for increasing the number of overlapped operations also increases. RAID-5 arrays can support more disks than a RAID-4 array which allows a RAID-5 array to achieve higher data storage capacity and higher number of disks for better performance.
Some RAID storage systems contain spare disk drives. Storage units with additional spare disks are designed to operate continuously over a specified period of time, without requiring any repair of the unit due to failed disks. This is accomplished by carefully identifying and quantifying the components that are expected to fail during a given time period, and incorporating within the system sufficient hot-spare parts or disks. This internal spare disk architecture can automatically switch to the spare disks when a failure is encountered. Spares are incorporated so that compatible disk devices are always at hand upon a disk failure.
Prior Art FIG. 1 depicts a common implementation of a data storage system 100 containing spare disks. The data storage system is arranged in a RAID 5 configuration 110-150 with three spares 162-166. In the data storage system 100, a data volume is divided into segments (e.g., 64 KB) called stripe units. Stripe units are mapped consecutively on a set of physical devices for parallel access purposes.
In order to recover from physical device failures (e.g., a disk) functions generating redundancies of a group of stripe units are generated and mapped to distinct physical devices. Normally, each member of the group has to be mapped to a different physical device in order to make the recovery possible. The set of functions form a set of equations with a unique solution. A single even parity function is commonly used and can recover from any single device failure in the group.
For example, data storage system 100 contains eight disks. Five of the disks (e.g., disks 110, 120, 130, 140, and 150 contain data and their redundancies). The remaining three disks (e.g., disks 162, 164, and 166) are spare disks.
Further, in the RAID-5 configuration, system 100 stripes its data across groups of data stripe units. In the redundancy group of stripe unit-0, disk 110 contains data block-0, disk 120 contains data block-1, disk 130 contains data block-2, and disk 140 contains data block-3. Disk 150 in stripe unit-0 contains the redundancy data for blocks 0-3.
In the RAID-5 configuration, system 100 puts the redundancy data for the next redundancy group associated with stripe unit-1 on disk 140 rather than on disk 150 for the redundancy group.
The disadvantage of the configuration illustrated in system 100 is the relatively large number of accesses required for performing a partial rewrite of the data in the redundancy group of stripe units. This drawback is specially noticeable in smaller RAID data storage systems. In a write operation, if the entire data involved in a redundancy group is to be written (e.g., a full stripe write), then the redundancy can be readily generated. However, in many cases, a write operation involves only part of the data involved in the group with the remainder data remaining the same.
Depending on the size of the data to be updated, one of the following two schemes can be used to perform a partial rewrite. In the first scheme, the remaining data in the group is read from the devices to help generate the required redundancies in conjunction with the new data to be written. This scheme is referred as “reconstruct write” scheme. This still requires accessing the entire group of stripe units to generate the new redundancies, and generally provides no additional efficiency benefits.
In the second scheme, the old data corresponding to the new data to be written is read along with the old redundancy to help generate the new redundancy in conjunction with the data to be written. This scheme is referred as “read-modify-write” scheme. This scheme is based on the fact that the functions used are generally idempotent binary operations (e.g., the exclusive OR function: XOR).
The second scheme is efficient for “small writes” and is commonly used. However, for a RAID system with “r” redundancies, it requires 2(r+d) accesses to the disk drives, where “d” is the number of data disks involved in the small write. For instance, “r” accesses to read the old redundancies, “d” accesses to read the old data, “d” accesses to write the new data, and “r” accesses to write the new redundancies. For example, the commonly used one redundancy scheme requires four accesses for every partial write, if the data to be written fits entirely on one disk. For larger data storage systems, each additional access reduces throughput and the operating efficiency of the entire system.
Throughput is affected even more greatly in a system with two redundancy schemes. For example, the less frequently implemented P+Q (r=2) scheme requires an even greater number of accesses (e.g., six accesses per partial write). This is a barrier for the consideration of the more fault-resilient schemes with greater than one redundancy.
Still another disadvantage is the inherent performance degradation within a RAID-5 system. As the number of disks in the array increases, the mean time to data loss (MTDL) is shorter in a RAID-5 system due to the higher probability that a second disk or a block of data on a disk will fail before a failed disk is repaired, even despite the number of spare disks available.