Computer systems and networks require increasing quantities of storage capacity. The need for large amounts of storage together with concerns about reliability, availability of data, and other issues have led to the development of centralized file servers that store files for a number of users. File servers are usually managed by a single entity for many users providing at least the obvious economies for IT managers. In file servers, many of the benefits that accrue in terms of reliability and availability arise because many individual hard disks are used in concert. File servers that seek to maximize these advantages through a variety of mechanisms fall under the umbrella term xe2x80x9cRedundant Array of Independent Disksxe2x80x9d or xe2x80x9cRAID.xe2x80x9d The operations of individual drives in a RAID array may be coordinated to provide fast reliable service to many users with a minimum of burden on the systems supported by it, such as networks or individual hosts attached to it.
In a typical RAID configuration, data is divided among the drives at the bit, byte or multiple-byte block level. This leads to the possibility of reading and writing data to multiple drives in parallel for even individual file requests. Because the speed of disk drives is usually the bottleneck in file server systems, not the memory or bus speeds, such parallel access can lead to manifold increase in throughput. In fact, throughput enhancement due to this parallel access capability is a commonly-enjoyed benefit of RAID storage systems.
The general technique used to divide data among multiple drives is called data xe2x80x9cstriping.xe2x80x9d The reason for the term xe2x80x9cstripingxe2x80x9d is that a single stream is written in uniformly-sized blocks in a regular sequence that can be depicted diagrammatically is painting a stripe across an array of disks. Generally, in these systems, reliability advantages are obtained by adding redundancy to the stored data so that if a disk drive goes bad, the stored data can still be retrieved intact from the data written to other disks. Many types of self-correction systems are known in which even a small amount of additional data can be used to reconstruct a corrupted data sequence. A simple way of doing this is called mirroring where the same data is duplicated on more than one disk. A more sophisticated technique (and one that is highly prevalent) is to generate a relatively small quantity of parity data, which can be used to reconstruct a bad data stream. Often mirroring and striping are used in concert so that each xe2x80x9cdrivexe2x80x9d is actually a small mirrored array consisting of more than one drive containing copies of the same data.
Significant processor capacity is required in RAID systems. The generation of the redundant data used for regenerating good data when a drive goes bad is computationally intensive. Whenever new data is written to a RAID array, parity data must be generated for each block of data. Parity computation uses a logical operation called xe2x80x9cexclusive ORxe2x80x9d or xe2x80x9cXORxe2x80x9d. The xe2x80x9cORxe2x80x9d logical operator is xe2x80x9ctruexe2x80x9d (1) if either of its operands is true, and false (0) if neither is true. The exclusive OR operator is xe2x80x9ctruexe2x80x9d if and only if one of its operands is true. It differs from xe2x80x9cORxe2x80x9d in that if both operands are true, xe2x80x9cXORxe2x80x9d is false. The reasons this operation is used is that it is pretty easy to design a processor that can do a lot of XORs very fast. Also, when data is XORed twice, it is undoes the first XOR operation. XOR A with B and then XOR the result with B and the result is A. In RAID, parity data is generated from the data to be stored by XORing each data block with the next data block in a succession, say, of four data blocks. Then if the four blocks are written to four consecutive disks and the result of the four XOR operations written to a fifth (redundant) disk, when one of the four drives goes bad, XORing the remaining good blocks of the four and the fifth will generate the bad block. The operation works for any size blocks. When data is read from the array, XOR calculations may or may not be performed depending on the design of the system.
There can be a big difference between read and write performance. Random reads may only require parts of a stripe from one or two disks to be processed in parallel with other random reads that only need parts of stripes on different disks. But for random writes, every time data on any block in a stripe, is changed, the parity for that stripe has to be calculated anew. This requires the writes not only for the particular blocks to be written, but also reads of all the other pieces of the stripe plus the writing of the new parity block.
The generation of all the parity data takes a substantial quantity of processing capacity in both the read and write directions. Also, the reconstruction of data read from a bad disk takes a great deal of computation. When a disk goes bad, the hardware has to be fixed, but this is preferably done without halting access to data and the system goes into what is called xe2x80x9cdegraded modexe2x80x9d or xe2x80x9cdegraded state,xe2x80x9d continuing to operate, but without all of its disks. Modern RAID systems are designed so that bad data retrieved from a corrupted array is recognized and corrected and delivered to users in such a degraded mode. But this degraded mode is normally significantly limited in some ways that may or may not be apparent to a given user. Users requiring high throughput rates, particularly, may notice a significant loss in performance. For example, users requesting large files, for example streaming audio or video, may find degraded mode provides significantly slower performance. The construction of parity data and the processes of error checking and data reconstruction are not the only computational burdens for a RAID array. Modern disk array architectures also perform operations such as request sorting, prioritizing, combining and redundancy management, among others.
The computer hardware behind RAID systems are usually fairly specialized systems, designed to maximize the performance of a very narrow range of processes, such as queuing, error detection, error correction, and high speed data transfer including caching. Because these requirements are fairly specialized, the hardware designed to support high performance RAID systems has generally been rather specialized. Typically these design requirements are based on performance benchmarks involving many simultaneous accesses to small file units. For example, performance might be indicated by input-output operations per second based on an average 2-kilobyte file size. This type of benchmark places more emphasis on latency and less on throughput and, as a result, typically RAID systems are designed to emphasize these. However, as mentioned above, streaming data places very different demands on a storage system and these are particularly difficult to meet when the system is operating in degraded mode. Systems designed to perform well in terms of the traditional benchmarks tend to do rather poorly in such situations. Thus, there is a need for storage system designs that are capable of providing high minimum performance guarantees for throughput under degraded mode operation.
A storage processor for a block storage RAID array services disk storage block requests from one or more hosts. At its heart, a application specific integrated chip (ASIC) supports a store and forward data transfer regime in that host to disk transfers are made by placing data in storage processor memory under control of the storage processor, operated on by the ASIC, and sent to the disk array. Disk to host transfers are made by placing the same data store, checked or regenerated by the ASIC, and sent to the requesting host. The main data highway in this model is a host-memory-disk path and memory bandwidth is therefore critical. A single memory space is addressed, in the preferred embodiment, by multiple buses under software management to even the load and provide bandwidth approaching a multiple of that of a single bus.
The problem of achieving high throughput, even under degraded mode conditions, is addressed by providing parallel execution of certain operations that are identified, in the context of the chosen architecture, to be critical to a minimum throughput guarantee. To clear a path for parallel execution, coherence issues that would normally arise with caching are avoided by relying on a cacheless configuration. Point-to-point communications are defined using switches and a number of FIFOs in key ways that allow the sharing of channels between the various devices transferring addressing and user data. By avoiding the use of caches altogether, data traffic to support coherency (e.g. broadcasting of invalidates and such) is eliminated. In addition, the control processor or processors need never operate on the data transferred between the host and disks, allowing parity calculation and data transfers to be handled with a minimum burden on the control processor.
The control processor is further unburdened by providing parallel execution of data transfers and parity calculations with prioritization and programming being prompted by via interrupts. Efficient handling of ordering is, preferably, provided by hardware logic-based masking of interrupts and by other mechanisms described further below.
The invention or inventions will be described in connection with certain preferred embodiments, with reference to the following illustrative figures so that it may be more fully understood. With reference to the figures, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of the preferred embodiments of the present invention or inventions only, and are presented in the cause of providing what is believed to be the most useful and readily understood description of the principles and conceptual aspects of the invention or inventions. In this regard, no attempt is made to show structural details of the invention in more detail than is necessary for a fundamental understanding of the invention or inventions, the description taken with the drawings making apparent to those skilled in the art how the several forms of the invention or inventions may be embodied in practice.