A storage array or disk array is a data storage device that includes multiple disk drives or similar persistent storage units. A storage array can allow large amounts of data to be stored in an efficient manner. A storage array also can provide redundancy to promote reliability, as in the case of a RAID system. In general, RAID systems simultaneously use two or more hard disk drives, referred to herein as physical disk drives (PDs), to achieve greater levels of performance, reliability and/or larger data volume sizes. The phrase “RAID” is generally used to describe computer data storage schemes that divide and replicate data among multiple PDs. In RAID systems, one or more PDs are set up as a RAID virtual disk drive (VD). In a RAID VD, data might be distributed across multiple PDs, but the VD is seen by the user and by the operating system of the computer as a single disk. The VD is “virtual” in that storage space in the VD maps to the physical storage space in the PDs, but the VD usually does not itself represent a single physical storage device.
Although a variety of different RAID system designs exist, all have two key design goals, namely: (1) to increase data reliability and (2) to increase input/output (I/O) performance. RAID has seven basic levels corresponding to different system designs. The seven basic RAID levels are typically referred to as RAID levels 0-6. RAID level 5 uses striping in combination with distributed parity. The term “striping” means that logically sequential data, such as a single data file, is fragmented and assigned to multiple PDs in a round-robin fashion. Thus, the data is said to be “striped” over multiple PDs when the data is written. The term “distributed parity” means that the parity bits that are calculated for each stripe of data are distributed over all of the PDs rather than being stored on one or more dedicated parity PDs. Striping improves performance because the data fragments that make up each data stripe are written in parallel to different PDs and read in parallel from the different PDs. Distributing the parity bits also improves performance in that the parity bits associated with different data stripes can be written in parallel to different PDs using parallel write operations as opposed to having to use sequential write operations to a dedicated parity PD. In order to implement distributed parity, all but one of the PDs must be present for the system to operate. Failure of any one of the PDs necessitates replacement of the PD, but does not cause the system to fail. Upon failure of one of the PDs, any subsequent reads can be calculated from the Distributed parity such that the PD failure is masked from the end user. If a second one of the PDs fails, the system will suffer a loss of data, and the system is vulnerable until the data that was on the failed PD is reconstructed on a replacement PD.
One of the well-documented problems of RAID systems that use parity (i.e., RAID systems that use RAID levels 3-6) is that they are susceptible to the occurrence of write holes, which can result in data corruption. Write holes are possible due to the lack of atomicity, i.e., multiple writes per stripe are needed to write the data and the associated parity bits. The manner in which a write hole can result in data corruption will now be described with reference to the block diagram shown in FIG. 1. FIG. 1 is a block diagram of a RAID level 5 array 2 having multiple PDs, namely, PD0, PD1 and PD2. Each stripe of data is made up of an A data fragment and a B data fragment. For each A and B data fragment, parity bits, P, are computed by performing an exclusive OR operation that exclusive ORs the A data fragment with the B data fragment. As can be seen in FIG. 1, the P blocks are distributed across all of PDs, PD0, PD1 and PD2, to provide distributed parity. The P bits are calculated by reading the data fragments A and B and performing the operations represented by the equation: A XOR B=P. The P bits are then written to one of PD0, PD1 or PD2. If one of PD0, PD1 and PD2 fails, the data in the failed PD can be reconstructed by using the parity bits and the data from the non-failing PDs in accordance with the following equations: A=B XOR P, B=A XOR P and P=A XOR B.
An example of the manner in which a write hole may result in data corruption will now be described with reference to sector 3 of the RAID level 5 array 2. For purposes of this example, it will be assumed that (1) during the process of updating the B data fragment of sector 3, B3, a power failure occurs so that B3 is incompletely written, and (2) PD0 fails at some point in time after power to the system has been restored. When PD0 fails, the RAID reconstruction algorithm will attempt to reconstruct the data of PD0 using the data and parity bits from PD1 and PD2. Due to the occurrence of the power failure, the parity bits P3 will be inconsistent with the data fragment B3. For this reason, the data that was stored in PD0 will not be properly reconstructed. It should be noted that this would systematically result in a silent data corruption, meaning that the storage subsystem has no way to detect or correct the inconsistency.
A variety of techniques have been used or proposed to reduce or eliminate the potential for data corruption to be caused by the occurrence of a write hole. Most of the known or proposed techniques involve setting a lock in a nonvolatile random access memory (NVRAM) element immediately before a stripe is written and removing the lock from the NVRAM element immediately after the stripe is been written. A RAID controller sets and removes the locks. The NVRAM element is typically part of the RAID controller. The lock identifies the addresses in the PDs that are allocated to the stripe. Just prior to the first data fragment of the stripe being written, the lock is set. Just after the last bit of the stripe has been written, the lock is removed. If a catastrophic event such as a power failure occurs, upon reboot, the RAID controller reads the contents of the NVRAM element and identifies all existing locks. Then, the RAID controller re-computes the parity bits for all of the stripes that were locked at the time that the power failure occurred. This technique of locking and unlocking stripes ensures that parity bits and/or data that were potentially corrupted due to the power failure are re-computed so as to be consistent with the data should the need to reconstruct other data in the same stripe arise.
One of the disadvantages of the technique described above is that the local NVRAM element is relatively expensive and represents a significant portion of the cost of a parity-based RAID system. Another disadvantage of this approach is that it does not work in cases where the RAID controller has failed and needs to be replaced. These disadvantages can be overcome by replacing the NVRAM element with some other type of storage medium, such as, for example, one of the PDs or a solid-state flash memory element that is local to the RAID controller. These other solutions, however, also have drawbacks. If the lock is stored in a reserve area of one of the PDs, the number of accesses to the PD per stripe will increase from four to six, i.e., one read to get the old data, one read to get the old parity (both necessary to compute the new parity), one write to write the data, one write to write the parity bits, one write to set the lock and one write to remove the lock. The additional writes would effectively reduce the write performance of the RAID system by one third.
Storing the lock in a flash memory element that is local to the RAID controller would be faster than storing the lock in a reserve area of one of the PDs, and therefore would provide much better write performance. However, flash memory elements have limited life expectancies when measured in terms of Program/Erase (P/E) cycles. Typically flash memory elements have life expectancies of about 10,000 to 100,000 P/E cycles. Because a RAID system may have, for example, one hundred thousand input/output transactions (IOs) per second, a flash memory element that is used for this purpose would be exhausted in as little as a few minutes.
Another approach that would avoid the need to use local NVRAM to store locks would be to simply mark the RAID array as “open” at first write and as “closed” after a clean shutdown. In the case of a power failure, upon reboot, the RAID controller would detect that there was not a clean shutdown and would start regenerating parity for all stripes. The main disadvantage of this approach is that parity may have to be regenerated for hundreds of millions or billions of stripes, which could take days or weeks to complete. Therefore, this approach would not be practical from a duration standpoint or from a life-expectancy standpoint.
A need exits for a way to provide write hole protection in a parity-based RAID system that does not require the use of an expensive local NVRAM element in the RAID controller. A need also exists for a way to provide write hole protection in a parity-based RAID system that does not reduce write performance or detrimentally impact storage device life expectancy.