1. Field of the Invention
This invention generally relates to information storage technology and, more particularly, to a system and method for efficiently initializing and writing to a redundant array of independent disks (RAID).
2. Description of the Related Art
RAID is a technology that stores data by using multiple hard drives, connected in parallel to share and duplicate data. The data is stored in such a way that all stored data can be recovered if one of the hard drives fails. There are many configurations of the RAID, which are typically referred to as the RAID level. As seen from the point of view of a host-connected operating system, the RAID combination of multiple hard drives appears as a single logical unit (e.g., a single hard drive).
As used herein, the term “striplet” is used to describe a user-defined size block of data written to one drive. The term “stripe” describes a plurality of adjacent, related striplets across each disk. In RAID 5 and RAID 6, a collection of striplets forms a consistent, identifiable stripe with some of the striplets comprising data and the others comprising parity data. For RAID 5, one of the striplets in each stripe is designated as a parity striplet. This striplet is the product of an exclusive-or (XOR) operation that has been performed with all the other striplets in the stripe. The operation for XOR'ing data to create a parity striplet is referred to as P-calculation. The purpose of the parity is to provide a level of redundancy. Since the RAID depicts a virtual disk with multiple physical disks, there is a higher probability of one the individual physical disks may fail. If one of the striplets cannot be read due to an individual disk error or failure, the data for that striplet can be reassembled by XOR'ing all the other striplets in the stripe.
FIG. 1 is a schematic block diagram of a RAID 5 array depicting the organization of parity and data striplets into stripes (prior art). The redundancy of RAID 5 can accommodate one failure within a stripe. RAID 6, in addition to the “P-striplet”, allocates one or more “Q-striplets” to accommodate two or more failures. The operation for calculating Q data involves Galois arithmetic applied to the contents of the other striplets in the stripe.
As compared to earlier RAID levels, RAID 5 and 6, other than offering increased fault resiliency, also provide better performance when reading from the virtual disk. When multiple read commands are queued for the RAID'ed disks, the operations can be performed in parallel, which can result in a significant increase in performance as compared to similar operations to a single disk. If, however, there is a failure reading the requested data, then all the remaining data of the stripe needs to be read to calculate the requested data.
For operations that write data to the RAID'ed disks, performance can be adversely affected due to the P and Q calculations necessary to maintain redundant information per stripe of data. In RAID 5, for every write to a striplet, the previously written data to that striplet needs to be XOR'ed with the P-striplet, effectively removing the redundant information of the “old” data that is to be overwritten. The resulting calculation is then XOR'ed with the new data, and both the new data and the new P-calculation are written to their respective disks in the stripe. Therefore, a RAID 5 write operation may require two additional reads and one additional write over that of a single disk write operation. For RAID 6, there is an additional read and write operation for every Q-striplet.
Therefore, if a RAID array becomes degraded due to a failed disk, and the P/Q parity mechanisms must be invoked to restore the data, it is crucial that the array be properly initialized. Conventionally, RAID arrays are initialized using inefficient brute force approaches.
FIG. 2 is a schematic block diagram of a conventional RAID array controller (prior art). On a “write to disk” operation, the host-generated data is retrieved from the host interface memory through a DMA (direct memory access) mechanism into a write cache. For example, the host may be a computer or other microprocessor driven device enabled through the use of an operating system. The array controller microprocessor moves the data from the write cache through the disk controller onto the disk media. The array controller microprocessor executes its own programs out of the ROM. This simplified block diagram is used for illustrative purposes. An actual RAID controller would additionally include a number of hardware acceleration features built into it, including P & Q arithmetic calculators, high speed processor RAM, and NVRAM for storing data during power fail, to name a few.
FIG. 3 is a schematic diagram illustrating conventional means for initializing a disk array (prior art). Typically, there are two usage modes, “offline”, and “online”. While offline, an array remains unavailable for normal data access read/write during initialization. Only after initialization is complete may the array be accessed to write/read content data. There are many advantages to this mode. First, this is the fastest way of initializing. Second, when the initialization completes, the disk array can be verified (a verification that the parity matches the data) as all the stripes in the array have a consistent parity. Since the only writes to the array (and there are no reads) are for initialization, the firmware can initialize the array in the most efficient mode possible, writing large quantities of data with each single command, and writing all drives simultaneously. Since the array data is undetermined prior to initialization, zeroing the array is the only logical mechanism for initialization. There are also a few disadvantages to this method as well. One of the main downsides is that the array is not available until initialization is complete. This method is not entirely scalable either if the processor has to be highly involved in the individual disk initializations. Another drawback is that this method is destructive and therefore not suitable for online initialization where host I/Os may coexist.
In the online mode, an array is available for normal data access read/write during initialization. The data written during this usage mode is fully redundant and protected against a disk failure. The advantage of this mode is immediate array availability for normal read/write access. The downside is lower performance of the array until initialization is complete, as the initialization process competes for the disks with host I/Os. Further, initialization is much slower than in offline mode. While online, if a write to the array is less data than a full stripe write, the remainder of the stripe must be read, parity calculated, and then a full stripe write done. This process is referred to as a peer-read process.
FIG. 4 is a diagram illustrating a peer-read input/output (I/O) write operation (prior art). This process allows for full and immediate redundancy of all written data. The inefficiency of the disk accesses, read from other drives not involved in host transfer to form a full stripe, also increases as the number of drives in the array becomes higher. One of the main drawbacks for this scheme is that the parity is consistent with the data only for the portions of the stripes that have been written by the host system. Since the disk array has not been previously initialized, a verification operation on the disk array is not possible and requires a parity reconstruction operation.
Returning briefly to FIG. 1, the RAID 5 array includes M disks. The parity block is shown rotating from striplet to striplet with the progression of stripes through the array. This diagram very generally illustrates the steps used by the online parity reconstruction method. This method reconstructs the parity one stripe at a time, but the reconstruction of multiple stripes can be combined through a higher-level operation to minimize disk seeks. Since this reconstruction method does not know what piece of data/parity set is consistent, it has to reconstruct parities for the whole array.
FIG. 5 is a diagram illustrating the process for parity reconstruction (prior art). If the array is not already initialized, and a Read-Modify-Write I/O process is used to write data, there is no guarantee that the data is truly redundant. That is, the parity matches the data only for the portion of the stripe that have been written, but not for the entire stripe. For this reason, a parity reconstruction operation must be used to enable a verification operation to be performed. Once the parity reconstruct operation passes the written stripe, the full stripe is consistent. However, the operation requires that a full stripe write will be rewritten even though it is actually redundant already.
The above-mentioned processes all suffer from a number of bottlenecks related to array initialization. First, the array controller microprocessor must transfer every byte of data to every disk drive. In a five drive array of 500 gigabyte (GB) disks, this is 5 billion sectors of data. Second, the disk controllers must handle every byte of data. Although there is one disk controller for each drive, each disk controller must still handle (continuing the example) 1 billion sectors of data. This data is all identical zeros data.
Returning to FIG. 2, the processor on the RAID controller detects the number, type, and capacity of connected disk drives, and begins to write a known data pattern (either zeroes, or the data already existing on the drive) with proper parity (in the case of RAID 5 and 6) to each of the drives, either sequentially or in parallel. Normally the initialization data is written in parallel to allow for simultaneous transfer of real application data to the array. In the offline case, no data may be written to the drives until after initialization. In essence, the RAID controller processor is directing the write of every block to every drive even though the data “blocks” that are written might be an aggregated collection of the 512 byte blocks recognized by the disk drive. For example, it would be reasonable to aggregate 128-512 byte blocks into one striplet and write the entire striplet (64K bytes) in a single command. While conceptually easy, this approach uses a lot of the RAID controller processor and I/O bandwidth, for identical data being written to every block on every drive. A 500 Gigabyte drive has one billion 512 byte blocks, which corresponds to a lot of commands to just initialize one drive. This activity is especially significant if there is ongoing I/O to the array, simultaneously during initialization, as a result of some user application reading and writing the array prior to the completion of initialization.
It would be advantageous if a RAID array could be efficiently initialized concurrently with host I/O writes in the online mode, such that the RAID controller processor and I/O can devote themselves to the non-initialization data movement while the initialization goes on in parallel, in the background.
It would be advantageous if a RAID array could be initialized in the offline mode using a minimal numbers of zeros data transfer commands.