The present invention, in some embodiments thereof, relates to data protection in a random access disk array, and, more particularly, but not exclusively, to a variation of a RAID system to provide for data protection.
Raid is an acronym for Redundant Array of Independent Disks, and is a system for storing data on multiple disks in which redundancy of data storage between the disks ensures recovery of the data in the event of failure. This is achieved by combining multiple disk drive components into a logical unit, where data is distributed across the drives in one of several ways called RAID levels.
RAID is now used as an umbrella term for computer data storage schemes that can divide and replicate data among multiple physical disk drives. The terms disks and drives will be used interchangeably henceforth. The physical disks are said to be in a RAID array, which is accessed by the operating system as one single disk. The different schemes or architectures are named by the word RAID followed by a number (e.g., RAID 0, RAID 1). Each scheme provides a different balance between three key goals: increasing data reliability, decreasing capacity overhead and increasing input/output performance.
The most basic form of RAID—a building block for the other levels but not used for data protection, is Raid 0, which has high performance but no redundancy. The data is spread evenly between N disks. RAID 0 gives maximum performance since data retrieval is carried out on all N disks in parallel. However each data item is stored exactly once so disk failure always loses some data.
RAID 1 requires mirroring of all the data. Capacity drops by 50% since all data is stored twice, but excellent performance is still achieved since the data is still spread between disks in the same way, allowing for parallel reads. RAID 1 can support failure of one of each pair of disks, however the price is the loss of half of the capacity. Although multiple disk failures can be tolerated, only one failure is possible per mirrored pair without loss of data.
In greater detail, RAID 1 is mirroring. Mirroring comprises writing each block of data to two disks, D0 and D1, and reconstructing a disk by copying its minor disk upon failure. This method requires performing two disk writes per user write, and consumes an overhead of 100% in capacity. Its rebuild requires performing reads and writes in proportion to the size of the failed disk, without additional computation penalties. Additionally, reading data which resided on the failed disk while in degraded mode requires a single disk read, just as under a normal system operation.
In general, RAID-1 protects from single disk failure. It may protect from more than one failure if no two failed disks are part of the same pair, known as a “RAID group”. RAID-1 may also be implemented in “n-way mirroring” mode to protect against any n−1 disk failures. An example is RAID 1.3 which introduced three way mirroring, so that any two disks could fail and all the data could still be recovered. The cost however is that there is only 33% utilization of the disks.
A requirement thus became apparent, to somehow develop a system that allowed for the system to recover all data after the failure of any disk at the cost of a more reasonable overhead, and as a result RAID 4 was developed.
RAID 4 uses a parity bit to allow data recovery following failure of a bit. In RAID 4 data is written over a series of N disks and then a parity bit is set on the N+1 disk. Thus if N is 9, then data is written to 9 disks, and on the tenth, a parity of the nine bits is written. If one disk fails the parity allows for recovery of the lost bit. The failure problem is solved without any major loss of capacity. The utilization rate is 90%. However the tenth disk has to be changed with every change of every single bit on any of the nine disks, thus causing a system bottleneck.
In greater detail, a RAID-4 group contains k data disks and a single parity disk. Each block i in the parity disk P contains the XOR of the blocks at location i in each of the data disks. Reconstructing a failed disk is done by computing the parity of the remaining k disks. The capacity overhead is 1/k. This method contains two types of user writes—full stripe writes known as “encode” and partial stripe modifications known as “update”. When encoding a full stripe, an additional disk write must be performed for every k user writes, and k−1 XORs must be performed to calculate the parity. When modifying a single block in the stripe, two disk reads and two disk writes must be performed, as well as two XORs to compute the new parity value. The rebuild of a failed block requires reading k blocks, performing k−1 XORs, and writing the computed value. Reading data which resided on the failed disk while in degraded mode also requires k disk reads and k−1 XOR computations. RAID-4, like RAID-1, protects from a single disk failure.
RAID 5 solves the bottleneck problem of RAID 4 in that parity stripes are spread over all the disks. Thus, although some parity bit somewhere has to be changed with every single change in the data, the changes are spread over all the disks and no bottleneck develops.
However RAID 5 still only allows for a single disk failure.
In order to combine the multiple disk failure of RAID 1.3 with the high utilization rates of RAID 4 and 5, and in addition to avoid system bottlenecks, Raid 6 was specified to use an N+2 parity scheme that allows failure of two disks. RAID 6 defines block-level striping with double distributed parity and provides fault tolerance of two drive failures, so that the array continues to operate with up to two failed drives, irrespective of which two drives fail. Larger RAID disk groups become more practical, especially for high-availability systems. This becomes increasingly important as large-capacity drives lengthen the time needed to recover from the failure of a single drive. Following loss of a drive, single-parity RAID levels are as vulnerable to data loss as a RAID 0 array until the failed drive is replaced and its data rebuilt, but of course the larger the drive, the longer the rebuild takes, causing a large vulnerability interval. The double parity provided by RAID 6 gives time to rebuild the array without the data being at risk if a single additional drive fails before the rebuild is complete.
Reference is now made to FIGS. 1A to 1C which show three stages of a method for data protection using a spare disk, known as a hot spare. In traditional disk arrays, using physical magnetic disks, data protection often involved having a hot spare disk. As shown in FIG. 1A, this hot spare disk is not used during normal system operation, but rather is kept empty, and used only when a regular disk failed. At this point an exact copy of the failed disk is recovered and written to the spare disk, as shown in FIG. 1B. During recovery, the lost data is written to the new disk exactly in the same fashion as it resided on the old disk. When the old disk is replaced, as shown in FIG. 1C, its replacement becomes the new hot spare disk. The hot spare method cannot handle the recovery of more than a single disk without human intervention of manually replacing the failed disk with an empty disk, unless you keep several hot-spare disks. The cost of keeping this spare disk is that it is not used during normal system operation and thus it reduces the total performance of the system. Another downside of having a single disk replace the failed disk is that the hot spare disk receives a storm of writes during recovery and becomes a system bottleneck, causing the recovery, or rebuild, process to take a while to complete.
Reference is now made to FIGS. 2A to 2C, which show a variation of the spare hot disk system in which space for the rewrite is reserved, or dedicated, across all the disks of the array, as is common in more contemporary arrays. Keeping dedicated spare space across all the disks is slightly more complex than keeping a dedicated spare disk. A coarse granularity, possibly static, mapping must be held between sections of the failed disk and hot spare sections distributed across the rest of the disks. This mapping should be smart in the sense that lost sections are not written to disks which have other sections in the same stripe as the lost section. FIG. 2A illustrates the initial state or state during normal operation. During normal operation, the dedicated spare sections are reserved and not be written to. As shown in FIG. 2B, during recovery, the lost data in each section is copied to a hot spare section on one of the remaining disks. This method mediates some of the faults of the previous option. The cost of keeping spare space is lower, since there is no performance penalty of having disks which are not used. Writing the lost data is also distributed across all the disks, reducing the recovery bottleneck and thus decreasing the recovery time. However, the overhead of the method of FIGS. 2A-2C is that when the old disks are replaced, the sections must be copied back to them, thus doubling the number of writes needed. Half of the writes are distributed across all disks, and the remaining half go to a single disk.
FIG. 2C illustrates such a recovery process. This also implies that a rebuild abort process, in case a failed (removed) disk is reinserted, will actually need to undo the work which was already performed and copy back the data. If dedicated spare space which is equal to the size of x disks is kept, x recoveries can be performed without human intervention. This x must be decided upon in advance and cannot change dynamically.
Previously Used IO Reduction Methods
The main problem with N+K RAID schemes such as RAID 4/5/6 (as opposed to RAID 1) is the IO overhead incurred upon user writes during regular system operation. RAID 1 has a single write overhead per user write, while RAID 4/5 have a penalty of 2 reads and 1 write on top of the user write, and RAID 6 schemes have a penalty of 3 reads and 2 writes. Thus, the main method used for reducing 10 overhead and increasing performance was to use a RAID 1 scheme.
Reference is now made to FIGS. 3A-3C, which illustrate the dedicated spare space method of FIGS. 2A-2C in a RAID 5 scheme. The S stripes contain data and the P stripes contain parity. FIG. 3A shows the initial state, during normal system operation. FIG. 3B shows the state during rebuild after failure of disk D1, and FIG. 3C shows the system after insertion of a new disk to replace the failed D1.
In all these N+K RAID schemes, encoding a full stripe of redundant data for protection is much more efficient in terms of IOs and computation, than updating a single block in that stripe. In fact, it is even more efficient than the RAID 1 alternative. However, forcing the writing of full stripes on magnetic drives, using various log structured approaches, severely degrades performance from a different perspective. The problem with this approach on magnetic drives is that grouping random access user writes into a full stripe harms subsequent sequential read operations by literally randomizing the application's access pattern to the underlying media. In fact, if the underlying media is not naturally random access, this will most likely degrade performance to a greater extent than using the naïve approach with the added 10 overhead it entails.
Under both of the methods of FIGS. 2A-C and FIGS. 3A-C, dedicated spare space must be pre-allocated, and the RAID stripe size is kept constant.
A solution to the general problem, which is agnostic of the user access pattern, does not seem to coincide with the nature of sequential media. Thus, much more complicated heuristics, which were in many cases tailored to specific user access patterns, were used to try to alleviate the problems described above.