Computer systems and networks require increasing quantities of storage capacity. The need for large amounts of storage together with concerns about reliability, availability of data, and other issues have led to the development of centralized file servers that store files for a number of users. File servers are usually managed by a single entity for many users providing at least the obvious economies for IT managers. In file servers, many of the benefits that accrue in terms of reliability and availability arise because many individual hard disks are used in concert. File servers that seek to maximize these advantages through a variety of mechanisms fall under the umbrella term “Redundant Array of Independent Disks” or “RAID.” The operations of individual drives in a RAID array may be coordinated to provide fast reliable service to many users with a minimum of burden on the systems supported by it, such as networks or individual hosts attached to it.
In a typical RAID configuration, data is divided among the drives at the bit, byte or multiple-byte block level. This leads to the possibility of reading and writing data to multiple drives in parallel for even individual file requests. Because the speed of disk drives is usually the bottleneck in file server systems, not the memory or bus speeds, such parallel access can lead to manifold increase in throughput. In fact, throughput enhancement due to this parallel access capability is a commonly-enjoyed benefit of RAID storage systems.
The general technique used to divide data among multiple drives is called data “striping.” The reason for the term “striping” is that a single stream is written in uniformly-sized blocks in a regular sequence that can be depicted diagrammatically is painting a stripe across an array of disks. Generally, in these systems, reliability advantages are obtained by adding redundancy to the stored data so that if a disk drive goes bad, the stored data can still be retrieved intact from the data written to other disks. Many types of self-correction systems are known in which even a small amount of additional data can be used to reconstruct a corrupted data sequence. A simple way of doing this is called mirroring where the same data is duplicated on more than one disk. A more sophisticated technique (and one that is highly prevalent) is to generate a relatively small quantity of parity data, which can be used to reconstruct a bad data stream. Often mirroring and striping are used in concert so that each “drive” is actually a small mirrored array consisting of more than one drive containing copies of the same data.
Significant processor capacity is required in RAID systems. The generation of the redundant data used for regenerating good data when a drive goes bad is computationally intensive. Whenever new data is written to a RAID array, parity data must be generated for each block of data. Parity computation uses a logical operation called “exclusive OR” or “XOR”. The “OR” logical operator is “true” (1) if either of its operands is true, and false (0) if neither is true. The exclusive OR operator is “true” if and only if one of its operands is true. It differs from “OR” in that if both operands are true, “XOR” is false. The reasons this operation is used is that it is pretty easy to design a processor that can do a lot of XORs very fast. Also, when data is XORed twice, it is undoes the first XOR operation. XOR A with B and then XOR the result with B and the result is A. In RAID, parity data is generated from the data to be stored by XORing each data block with the next data block in a succession, say, of four data blocks. Then if the four blocks are written to four consecutive disks and the result of the four XOR operations written to a fifth (redundant) disk, when one of the four drives goes bad, XORing the remaining good blocks of the four and the fifth will generate the bad block. The operation works for any size blocks. When data is read from the array, XOR calculations may or may not be performed depending on the design of the system.
There can be a big difference between read and write performance. Random reads may only require parts of a stripe from one or two disks to be processed in parallel with other random reads that only need parts of stripes on different disks. But for random writes, every time data on any block in a stripe, is changed, the parity for that stripe has to be calculated anew. This requires the writes not only for the particular blocks to be written, but also reads of all the other pieces of the stripe plus the writing of the new parity block.
The generation of all the parity data takes a substantial quantity of processing capacity in both the read and write directions. Also, the reconstruction of data read from a bad disk takes a great deal of computation. When a disk goes bad, the hardware has to be fixed, but this is preferably done without halting access to data and the system goes into what is called “degraded mode” or “degraded state,” continuing to operate, but without all of its disks. Modern RAID systems are designed so that bad data retrieved from a corrupted array is recognized and corrected and delivered to users in such a degraded mode. But this degraded mode is normally significantly limited in some ways that may or may not be apparent to a given user. Users requiring high throughput rates, particularly, may notice a significant loss in performance. For example, users requesting large files, for example streaming audio or video, may find degraded mode provides significantly slower performance. The construction of parity data and the processes of error checking and data reconstruction are not the only computational burdens for a RAID array. Modern disk array architectures also perform operations such as request sorting, prioritizing, combining and redundancy management, among others.
The computer hardware behind RAID systems are usually fairly specialized systems, designed to maximize the performance of a very narrow range of processes, such as queuing, error detection, error correction, and high speed data transfer including caching. Because these requirements are fairly specialized, the hardware designed to support high performance RAID systems has generally been rather specialized. Typically these design requirements are based on performance benchmarks involving many simultaneous accesses to small file units. For example, performance might be indicated by input-output operations per second based on an average 2-kilobyte file size. This type of benchmark places more emphasis on latency and less on throughput and, as a result, typically RAID systems are designed to emphasize these. However, as mentioned above, streaming data places very different demands on a storage system and these are particularly difficult to meet when the system is operating in degraded mode. Systems designed to perform well in terms of the traditional benchmarks tend to do rather poorly in such situations. Thus, there is a need for storage system designs that are capable of providing high minimum performance guarantees for throughput under degraded mode operation.