Data storage subsystems continue to provide increasing storage capacities to fulfill user demands from host computer system applications. Due to this critical reliance on large capacity mass storage, demands for enhanced reliability are also high. Various storage device configurations and geometries are commonly applied to meet the demands for higher storage capacity while maintaining or enhancing reliability of the mass storage subsystems.
A popular solution to these mass storage demands for increased capacity and reliability is the use of multiple smaller storage modules configured in geometries that permit redundancy of stored data to assure data integrity in case of various failures. In many such redundant subsystems, recovery from many common failures is automated within the storage subsystem itself due to the use of data redundancy, error codes, and so-called “hot spares” (extra storage modules which may be activated to replace a failed, previously active storage module). These subsystems are typically referred to as redundant arrays of inexpensive (or independent) disks (or more commonly by the acronym RAID).
For example, in the conventional system illustrated in FIG. 1, a RAID controller 10 controls a storage array 12 in a manner that enables such recovery. A host system 14 (e.g., a server or computer) stores data in and retrieves data from storage array 12 via RAID controller 10. That is, a processor 16, operating in accordance with an application program 18, issues requests for writing data to and reading data from storage array 12. Although for purposes of clarity host system 14 and RAID controller 10 are depicted in FIG. 1 as separate elements, it is common for a RAID controller 10 to be physically embodied as a card that plugs into a motherboard or backplane of such a host system 14.
It is known to incorporate data caching in a RAID protected storage system. In the storage system illustrated in FIG. 1, RAID controller 10 includes a RAID processing system 20 that caches data in units of blocks, which can be referred to as read cache blocks (RCBs) and write cache blocks (WCBs). The WCBs comprise data that host system 14 sends to RAID controller 10 as part of requests to store the data in storage array 12. In response to such a write request from host system 14, RAID controller 10 caches or temporarily stores a WCB in one or more cache memory modules 21, then returns an acknowledgement message to host system 14. At some later point in time, RAID controller 10 transfers the cached WCB (typically along with other previously cached WCBs) to storage array 12. The RCBs comprise data that RAID controller 10 has frequently read from storage array 12 in response to read requests from host system 14. Caching frequently requested data is more efficient than reading the same data from storage array 12 each time host system 14 requests it, since cache memory modules 21 are of a type of memory, such as flash or Dual Data Rate (DDR) memory, that can be accessed much faster than the type of memory (e.g., disk drive) that data storage array 12 comprises.
Various RAID schemes are known. The various RAID schemes are commonly referred to by a “level” number, such as “RAID-0,” “RAID-1,” “RAID-2,” etc. As illustrated in FIG. 1, storage array 12 in a conventional RAID-5 system can include, for example, four storage devices 24, 26, 28 and 30 (e.g., arrays of disk drives). In accordance with the RAID-5 scheme, data blocks, which can be either RCBs or WCBs, are distributed across storage devices 24, 26, 28 and 30. Distributing logically sequential data blocks across multiple storage devices is known as striping. Parity information for the data blocks distributed among storage devices 24, 26, 28 and 30 in the form of a stripe is stored along with that data as part of the same stripe. For example, RAID controller 10 can distribute or stripe logically sequential data blocks A, B and C across corresponding storage areas in storage devices 24, 26 and 28, respectively, and then compute parity information for data blocks A, B and C and store the resulting parity information P_ABC in another corresponding storage area in storage device 30.
A processor 32 in RAID processing system 20 is responsible for computing the parity information. Processing system 20 includes some amount of fast local memory 34, such as double data rate synchronous dynamic random access memory (DDR SDRAM) that processor 32 utilizes when performing the parity computation. To compute the parity in the foregoing example, processor 32 reads data blocks A, B and C from storage devices 24, 26 and 28, respectively, into local memory 34 and then performs an exclusive disjunction operation, commonly referred to as an Exclusive-Or (XOR), on data blocks A, B and C in local memory 34. Processor 32 then stores the computed parity P_ABC in data storage device 30 in the same stripe in which data blocks A, B and C are stored in data storage devices 24, 26 and 28, respectively. The above-described movement of cached data and computed parity information is indicated in a general manner in broken line in FIG. 1.
The RAID-5 scheme employs parity rotation, which means that RAID controller 10 does not store the parity information for each stripe on the same one of data storage devices 24, 26, 28 and 30 as the parity information for all other stripes. For example, as shown in FIG. 1, parity information P_DEF for data blocks D, E and F is stored on storage device 28, while data blocks D, E and F are stored in the same stripe as parity information P_DEF but on storage devices 24, 26 and 30, respectively. Similarly, parity information P_GHJ for data blocks G, H and J is stored on storage device 26, while data blocks G, H and J are stored in the same stripe as parity information P_GHJ but on storage devices 24, 28 and 30, respectively. Likewise, parity information P_KLM for data blocks K, L and M is stored on storage device 24, while data blocks K, L and M are stored in the same stripe as parity information P_KLM but on storage devices 26, 28 and 30, respectively.
The described parity calculation and storage of the parity block requires time and resources to complete. A cache enabled storage controller provides maximum throughput from the host to the storage controller when a write-back cache policy is implemented. When such a write-back methodology is used, a host computer write operation is processed by temporarily storing the data associated with the write request to the cache. Once the information is saved in the cache, the storage controller reports to the host computer that the write operation is complete. Consequently, from the perspective of the host computer, the write operation is complete. Future requests for the information located in the cache are supported by reading the information and forwarding the same to the host computer.
Thereafter, the storage controller will locate, arrange and flush the information from the cache to the data storage devices supporting the RAID protected storage volume. The storage controller may perform these operations to minimize overhead and hard disk drive write head movement.
There are multiple “levels” or types of standard geometries generally recognized for storage systems that use RAID. In RAID level 0, data blocks are stored in order across one or more storage devices without redundancy. That is, none of the data blocks are copies of another data block and there is no parity block to recover from a disk failure. In a RAID level 1 system, one or more disks are used for storing data and an equal number of additional “mirror” disks for storing copies of the information are written to the data disks. Other RAID levels, identified as RAID level 2, 3, 4 segment the data into bits, bytes, or blocks for storage across several data disks. One or more additional disks are utilized to store error correction or parity information. A single unit of storage is spread across the several disk drives and is commonly referred to as a “stripe.” The stripe consists of the related data written in each of the disk drives containing data plus the parity (error recovery) information written to the parity disk drive. In RAID level 5, as described, the data is segmented into blocks for storage across several disks with a single parity block for each stripe distributed in a pre-determined configuration across each of the several disks. In RAID level 6, dual parity blocks are calculated for a stripe and are distributed across each of the several disks in the array in a pre-determined configuration. In RAID level 10 or 1+0, data blocks are mirrored and striped. In RAID level 01 or 0+1, data blocks are striped and the stripes are mirrored.
RAID storage subsystems typically utilize a control module that shields the user or host system from the details of managing the redundant array. The controller or control module makes the subsystem appear to the host computer as a single, highly reliable, high capacity disk drive. In fact, the RAID controller may distribute the host computer system supplied data across a plurality of the small independent drives with redundancy and error checking information so as to improve subsystem reliability. Frequently RAID subsystems provide large cache memory structures to further improve the performance of the RAID subsystem. The cache memory is associated with the control module such that the storage blocks on the disk array are mapped to blocks in the cache. This mapping is also transparent to the host system. The host system simply requests blocks of data to be read or written and the RAID controller manipulates the disk array and cache memory as required.
In RAID level 5 subsystems (as well as other RAID levels) there is a penalty in performance paid when less than an entire stripe is written to the storage array. If a portion of a stripe is written to the RAID subsystem, portions of the same stripe may need to be read so that a new parity block may be computed and re-written to the parity disk of the array. In particular, the old data stored in the portion of the stripe which is to be overwritten as well as the old parity block associated therewith needs to be read from the storage subsystem so that the new parity block values may be determined therefrom. This process is often referred to as a read-modify-write cycle due to the need to read old data from the stripe, modify the intended data blocks and associated parity data, and write the new data blocks and new parity block back to the storage array. This performance penalty is avoided if the entire stripe is written. When an entire stripe is written (often referred to as a stripe write or full-stripe write), the old data and old parity stored in the stripe to be overwritten are ignored. The new stripe data is written and a new parity block determined therefrom is written without need to reference the old data or old parity. A stripe write therefore avoids the performance penalty of read-modify-write cycles.
U.S. Pat. No. 6,760,807 to Brant et al. discloses a data storage system and method that applies an adaptive write policy for handling host write commands to write-back system drives in a dual active controller environment. The data storage system includes a host computer, a primary controller and an alternate controller. The primary and alternate controllers are coupled to one or more disk storage devices. When a write command is communicated from the host, the primary controller determines if the data encompasses an entire RAID stripe, and if so, parity data is calculated for the stripe and the data and parity data are written to the disk storage devices. Otherwise, the write data is stored in a cache and processed in accordance with a write-back policy.
U.S. Pat. No. 6,629,211 to McKnight et al. discloses a system and method for improving RAID controller performance through adaptive write back or write through caching. The system includes a host computer system and a RAID subsystem. The RAID subsystem includes a cache supported controller and a plurality of disk drives. The method uses the cache in write back mode when the RAID controller is lightly loaded and uses cache in write through mode when the RAID controller is heavily loaded. In the write back mode, the data is written to the cache prior to storing data to at least one disk drive of the plurality of disk drives. In the write through mode, the data is written directly to the one or more disk drives without going through the cache buffer.
U.S. Pat. No. 6,922,754 to Liu et al. discloses a data flow manager and a method for determining what data should be cached and what data should be sent directly to a data store. The decision to cache data or to send the data directly to the data store is determined based on the type of data requested, the state of the cache, the state of I/O components, or system policies. In one aspect, the data flow manager tries to predict data access patterns. In another aspect, the data flow manager attempts to group writes together. In still another aspect, the data flow manager receives an input responsive to the content contained in a data access. In this aspect, the data flow manager is a content aware data flow manager.
Conventional data storage controllers configured to operate in a write back or data caching mode send the data associated with write operations to a relatively fast memory (e.g. a dynamic random access memory or DRAM). When configured to operate in a write through mode, the data storage controllers forward all write operations to the backend or long term storage devices. Each mode has respective performance advantages under some workloads and performance disadvantages for some other workloads. In general, the write back mode can provide relatively short latency when the storage controller is lightly loaded. However, as the workload increases, so does the overhead associated with managing cached data. Thus, when the storage controller is heavily loaded, it is desirable to avoid the additional overhead that results from caching write data. Consequently, a write through mode is more appropriate for heavy workloads.
However, in a multi-workload environment a single cache policy or switching policy for a storage volume may not provide a desired performance for each workload.