1. Field of the Invention
This invention relates to the field of data storage devices, and more particularly relates to a method and system for improving the performance of storage systems via the cloning of units of storage.
2. Description of the Related Art
Information drives business. A disaster affecting a data center can cause days or even weeks of unplanned downtime and data loss that could threaten an organization's productivity. For businesses that increasingly depend on data and information for their day-to-day operations, this unplanned downtime can also hurt their reputations and bottom lines. Businesses are becoming increasingly aware of these costs and are taking measures to plan for and recover from disasters. This is particularly true of the data storage systems that maintain such businesses information.
With this growing focus on data storage, it has become desirable to implement commercial-grade storage performance and reliability akin to those of mainframe disk subsystems in a more cost-effective manner. In response to this need, techniques have been developed that abstract multiple disks into single storage objects, using commodity disks (such as SCSI and IDE drives) and system buses (such as ISA, EISA, PCI, and SBus). A data storage system employing this abstraction of multiple disks into a single storage object is referred to generically as a Redundant Arrays of Independent Disks (RAID) or RAID array.
A number of RAID types (referred to as RAID levels) have been defined, each offering a unique set of performance and data-protection characteristics. Originally, several RAID configurations (often called RAID levels) were proposed. (RAID levels are often abbreviated as RAID-x; for example, RAID level 5 may be abbreviated as RAID-5.) In addition to RAID levels 2 through 6, which use parity calculations to provide redundancy, two other disk configurations were retroactively labeled as RAID: striping, or interleaving data across disks with no added redundancy, was identified as RAID level 0, and mirroring, maintaining full redundant data copies, was identified as RAID level 1.
The important characteristics of each major RAID level are now presented. While RAID-0 offers no increased reliability, it can supply performance acceleration at no increased storage cost by sharing I/O accesses among multiple disks. By contrast, RAID-1 provides the highest performance for redundant storage, because read-modify-write cycles are not required when updating data (as required in RAID-5 storage systems). Moreover, multiple copies of data may be used in order to accelerate read-intensive applications. However, RAID-1 requires at least double the disk capacity (and therefore, at least double the disk expenditures) of a non-RAID-1 solution. RAID-1 is most advantageous in high-performance and write-intensive applications. Also, since more than two copies may be used, RAID-1 arrays can be constructed to withstand loss of multiple disks without suffering an interruption in service.
Use of mirroring (or RAID level 1) increases data availability and read I/O performance, at the cost of sufficient storage capacity for fully redundant copies. RAID levels 2 through 5 address data redundancy by storing a calculated value (commonly called parity, or parity information), which can be used to reconstruct data after a drive failure or system failure, and to continue servicing input/output (I/O) requests for the failed drive.
In order to increase reliability while preserving the performance benefits of striping, it is possible to configure objects which are both striped and mirrored. While not explicitly numbered as standard RAID configurations, such a combination is sometimes called RAID-1+0, RAID-0+1, or RAID-10. This configuration is achieved by striping several disks together, then mirroring the striped sets to each other, producing mirrored stripes. When striped objects are mirrored together, each striped object is viewed as if it were a single disk. If a disk becomes unavailable due to error, that entire striped object is disabled. A subsequent failure on the surviving copy would make all data unavailable. It is, however, extremely rare that this would occur before the disk could be serviced. In addition, use of hot spares makes this even less likely.
Among the parity RAID configurations, RAID-2 uses a complex Hamming code calculation for parity, and is not typically found in commercial implementations. RAID levels 3, 4 and 5 are, by contrast, often implemented. Each uses an exclusive-or (XOR) calculation to check and correct missing data. RAID-3 distributes bytes across multiple disks. RAID-4 and RAID-5 arrays compute parity on an application-specific block size, called an interleave or stripe unit, which is a fixed-size data region that is accessed contiguously. All stripe units at the same depth on each drive (called the altitude) are used to compute parity. This allows applications to be optimized to overlap read access by reading data off a single drive while other users access a different drive in the RAID. These types of parity striping require write operations to be combined with read and write operations for disks other than the ones actually being written, in order to update parity correctly. RAID-4 stores parity on a single disk in the array, while RAID-5 removes a possible bottleneck on the parity drive by rotating parity across all drives in the set.
RAID 5 protects the data for n disks with a single disk that is the same size as the smallest disk in the array. RAID 5 usable capacity equals s*[n−1], where s is the capacity of the smallest disk in the array and n is the total number of disks in the array. Not only does a RAID 5 array offer a very efficient way to protect data, such an array also has read performance similar to a RAID 0 array, although write performance suffers in comparison to a single disk (due to the read/modify/write cycle for writes, discussed subsequently). Because of its combination of data protection and performance, RAID 5 is popular for general-purpose servers such as file and Web servers.
The parity information generated is simply the result of an XOR operation on all the data elements in the stripe. Because XOR is an associative and commutative operation, to find the XOR result of multiple operands, one starts by simply performing the XOR operation of any two operands. Subsequently, one performs an XOR operation on the result with the next operand, and so on with all of the operands, until the final result is reached. Additionally, parity rotation is implemented to improve performance, as discussed subsequently.
A RAID 5 volume can thus tolerate the loss of any one disk without data loss. The missing data for any stripe is easily determined by performing an XOR operation on all of the remaining data elements for that stripe. If the host requests a RAID controller to retrieve data from a disk array that is in a degraded state, the RAID controller first reads all of the other data elements on the stripe, including the parity data element. It then performs all of the XOR calculations before returning the data that would have resided on the failed disk. All of this happens without the host being aware of the failed disk, and array access continues.
The RAID 5 write operation is responsible for generating the requisite parity data, an operation which is typically referred to as a read/modify/write operation (alternatively, a read/modify/log/write operation, if logging is implemented). This process, as will be appreciated, is time-consuming, in comparison to simply writing data to disk. This is substantial overhead, even in the case where the entire stripe is being written (a full stripe write).
FIG. 1A is a flow diagram illustrating a process of a full stripe write according to methods of the prior art. As will be appreciated, a typical write operation, for example in a RAID-5 volume, can be very expensive, both in terms of the resources required and the computational loads placed on the storage system and computing facilities involved. If the write is a full stripe write (as in the example here), the write is usually performed in three phases. In the first phase, parity information is computed (step 100). This is also referred to as the modify phase. In the next phase, the data and parity are logged (step 110). A RAID-5 volume may also have a log device that logs data, as well as parity information, during a write operation. Such a log allows for fast parity resynchronization and avoids data loss during one-and-half failures. This phase is also referred to as the logging phase. In the last phase, data and parity information are written to the storage system (step 120). This phase is also referred to as the write phase.
The overhead associated with this kind of operation is even more onerous in the case of a partial stripe write, in terms of overhead per unit of data written. Consider a stripe composed of a number of strips (i.e., stripe units) of data and one strip of parity information, as is the normal case. Suppose the host wants to change just a small amount of data that takes up the space on only one strip within the stripe. The RAID controller cannot simply write that small portion of data and consider the request complete. It must also update the parity data. One must remember that the parity data is calculated by performing XOR operations on every strip within the stripe. So when one or more strips change, parity needs to be recalculated using all strips, regardless of the amount of data actually changed. This mandates reading the information from the other (unchanging) strips, even though the data read, once used, will simply be discarded.
FIG. 1B is a flow diagram illustrating a partial stripe write according to methods of the prior art. A partial stripe write typically involves four phases. In the first phase, old data and parity information are read from the storage system (step 130). This phase is also referred to as the read phase. Next, parity information is computed (step 140). The phase is, as before, also referred to as the modify phase. Next, also as before, data logging is performed (step 150), and is again referred to as the logging phase. Finally, data and parity information are written to the storage system (step 160), which is referred to as the write phase, as before.
In greater detail, the read/modify/write operation can be broken down into the following actions:    1. Read new data from application.    2. Read old data from target disk for new data.    3. Read old parity from target stripe for new data.    4. Calculate new parity with an XOR calculation on the data from steps 1, 2, and 3.    5. Indicate potential lack of coherency. (Since it is not possible to guarantee that the new target data and new parity can be written to separate disks atomically, the RAID subsystem must identify that the stripe being processed is inconsistent.)    6. Write new data to target location.    7. Write new parity.    8. Indicate coherency restored. (The new target data and new parity information have been successfully written.)
It will be noted that the parity disk is involved in every write operation (steps 3 and 7). This is why parity is rotated to a different disk with each stripe. If the parity were all stored on the same disk all of the time, that disk could become a performance bottleneck.
Because of its combination of data protection and performance, RAID-5 is a popular alternative for a variety of commercial applications. However, reads are actually an integral part of a RAID-5 write (as well as reads, naturally), making read performance an important criteria in implementing a RAID-5 array. This issue of read performance also impacts the performance of storage systems implementing copy-on-write snapshots, as will be appreciated from the following discussion in connection with FIG. 1C.
FIG. 1C is a flow diagram illustrating a process of making a copy-on-write snapshot according to methods of the prior art. Such copy-on-write snapshots can be, for example, full-image snapshots, spaced-optimized snapshots or the like. As will be appreciated, writing to a volume having snapshots requires that the volume manager read the old contents of the region to be written, and to copy those old contents to the snapshots being made, before actually overwriting the region to be written to with the new application data. The process begins with the reading of the region's old contents (step 170). Next, the old contents are copied to the snapshots being made (step 180). Finally, the new application data can be written to these regions (step 190). As with read operations in RAID arrays, the performance of read operations in a storage system implementing copy-on-write snapshots is an important consideration in providing acceptable performance levels in such systems.
As will be appreciated, therefore, it is desirable to improve the performance of storage systems, such as those described above, for a variety of reasons. What is therefore needed is a technique that addresses the delays inherent in such storage systems, and, in particular, delays related to the read operations that must be performed in supporting techniques such as those discussed above. Moreover, such a solution should preferably do so without affecting the basic storage paradigm employed.