In this specification, the words data “restoration”, data “reconstruction”, data “rebuilding” or data “recovery” are used interchangeably to designate the activity of rebuilding data lost due to failure of a data storage device such as a solid state storage device. References to “disk” or “drive” or “device” failures are used interchangeably, although it is well understood that not all storage drives use rotating disks. RAID arrays can be implemented using solid state drive (SSD) devices, for example. The present invention enables data recovery regardless of the cause of failure of one of the devices within an array of storage devices.
A RAID storage architecture is an architecture that combines a plurality of physical disks connected to an array controller, which is connected via one or more high bandwidth buses to one or more host computers.
RAID stands for “Redundant Array of Independent Disks” or “Redundant Array of Inexpensive Disks”. The links between the controller and each storage device in the array may include Small Computer System Interface (SCSI) links. The array controller is typically responsible for controlling an individual disk or solid state drive, maintaining redundant information, executing requested transfers, and recovering from disk failures. The array combines the plurality of storage devices in a logical unit so that the array appears to the or each host computer as a linear sequence of data units, numbered for example 1 to N.B, where N is the number of devices in the array and B is the number of units of user data on each device.
Fundamental to all RAID arrays is the concept of striping consecutive units of data across the devices of the array. As introduced in “RAIDFrame a Rapid Prototyping Tool for RAID systems”, by William V. Courtright II, August 1996, striping is defined as breaking up linear address space exported by the array controller into blocks of some size and assigning consecutive blocks to consecutive devices rather than filling each device with consecutive data before switching to the next. The striping unit or stripe unit, which is set by the controller, is the maximum amount of consecutive data assigned to a single device. The striping unit can be, for example, a single bit or byte or some other data size smaller than the entire storage capacity of a physical device. Striping has two main benefits: automatic load balancing in concurrent workloads and high bandwidth for large sequential transfers by a single process. An N-disk coarse-grain striped array can service a number N of I/O (Input/Output) requests in parallel.
RAID arrays as defined in “A case for Redundant Arrays of Inexpensive Disks (RAID)” were introduced by David Patterson, Garth A. Gibson, and Randy Katz in 1987. The authors had originally conceived five standard schemes which are referred to as RAID levels 1 through 5. Many more variations, for example nested levels, have evolved in the standards or as proprietary solutions. RAID levels and their associated data formats are standardized by the Storage Networking Industry Association (SNIA) in the Common RAID Disk Drive Format (DDF) standard. Each scheme provides a different balance between three key goals: resilience, performance, and capacity.
For example, in RAID level 4, data is distributed across multiple devices and parity data for protecting against data loss is confined to a single dedicated parity disk or equivalent device. Each device in the array operates independently, allowing I/O requests to be performed in parallel. However, the use of a dedicated parity disk can create a performance bottleneck. As the parity data must be written to a single, dedicated parity disk for each block of non-parity data, the overall write performance may largely depend on the performance of this parity disk.
In RAID level 5, there are a variety of ways to lay out data and parity such that the parity is evenly distributed over the disks. FIG. 1 illustrates graphically an exemplary RAID Level 5 array 10 having 5 independent disks labeled Disk 1 to 5 in the figure. The left-symmetric organization shown in FIG. 1 is an example of a typical RAID level 5 layout. Each disk has 5 blocks. The RAID level 5 array 10 is formed by placing the parity units along the diagonal and then placing the consecutive user data units on consecutive disks at the lowest available offset on each disk. In RAID level 5, the parity blocks are distributed throughout the array rather than being concentrated on a single disk. This avoids throughput loss encountered due to having only one parity disk. The data integrity of the array is not destroyed by a single drive failure. Upon drive failure, any data lost in the failed drive can be calculated using the distributed parity such that the drive failure is not visible to the end user.
In FIG. 1, “block i” (where i is an integer between 1 and 20 inclusive) represents a block of user data of unspecified size and “Parity i-j” represents a parity block computed over data blocks i through j. The parity blocks representing redundant information for recovering data blocks hold cumulative XOR over the corresponding data units. For example, Parity 1-4=Block1 XOR Block2 XOR Block3 XOR Block4. Following a single drive failure, the failed drive is replaced and the associated data rebuilt. As illustrated in FIG. 2, if Disk 2 fails, block 2 will be lost. Block 2 is then reconstructed from the redundant data available on the remaining working disks. For example block 2 is recovered as block 2=Parity 1-4 XOR Block 1 XOR Block 3 XOR Block 4. The recovered data may be rebuilt on a dedicated existing spare drive 23 or distributed across the remaining drives of the array. Some storage systems implement a swap action to replace a failed drive with another drive and the data on the failed drive can be rebuilt after the failed drive is replaced, but many applications require a very fast rebuild that cannot wait for drive replacement.
One implementation of RAID level 5 is RAID level-5 Enhanced (or RAID 5E), which has a built-in spare disk. This RAID implementation stripes data and parity across all of the disks in the array. However, in a traditional RAID 5 configuration with a hot spare or dedicated spare disk 23, the spare disk 23 sits next to the array waiting for a drive to fail, at which point the spare disk 23 is made available and the array rebuilds the data set with the new hardware. In RAID level 5E, the spare disk is actually part of the RAID level-5E array.
FIG. 3 shows an example of a RAID level-5E logical drive. A RAID 5E array comprises five physical disks (Disks 1 to 5). A logical disk is created over the 5 physical disks. The data is striped across the disks, creating blocks (Blocks 1 to 16) in the logical disk. It should be noted that the “EMPTY” space in this figure is shown at the end of the array (i.e. the end block of each disk). The “EMPTY” space is the free space corresponding to the distributed spare disk. The storage of the data parity (denoted by “Parity”) is striped, and it shifts from disk to disk as it does in RAID level-5.
Referring to FIG. 4, when a disk 42 in a RAID 5E array fails, the data that was on the failed disk is reconstructed into remaining disks through use of the empty space at the end of the array. The array undergoes compression, and the distributed spare disk becomes part of the array. The logical disk remains RAID level-5E with parity blocks distributed across the disks. When the failed disk is replaced, the array is once again expanded to return the array to the original striping scheme (not shown on FIG. 4).
RAID level 6 is referred to as block-level striping with double distributed parity and provides fault tolerance of two drive failures as the array continues to operate with up to two failed drives.
Turning to the actual types of storage disks that can be provided in a RAID, Solid State Drive (SSD) devices are data storage devices that use nonvolatile flash memory to store data persistently. In contrast to traditional magnetic disks such as Hard Disk Drives (HDDs) or floppy disks, which are electromechanical devices containing spinning disks and movable read/write heads, SSDs do not employ any moving mechanical components and have lower latency than a spinning hard disk. If a hard disk has to read data from multiple locations, the drive heads are required to move between tracks and then typically have to wait some milliseconds for the correct blocks to rotate underneath them to be read.
A modern Solid State Drive performs much more quickly as it is a hard drive consisting of a collection of NAND (NOT AND) flash memories. Solid State Drives do not have moving heads and rotating platters. Every block of flash memory is accessible at the same speed as every other block of flash memory, whether the blocks are stored right next to each other or in different physical NAND chips. As a result, SSDs offer much lower latency and faster data access time compared to electromechanical disks. For example, when a HDD retrieves a large file, the above-described searches for the file may result in an access time of 10-15 ms whereas an SSD may retrieve the same file as quickly as 0.1 ms. SSD is typically about 10 times faster than the spinning disks in an HDD. In terms of Input/Output operations per second, SSDs can be used to replace multiple spinning disks. In addition to lower access time, SSDs can effectively read and write data faster offering quicker responses and faster transfer speeds resulting in higher throughput. SSD technology is therefore suitable for applications having high performance requirements. This makes SSD servers ideal for applications where throughput is important, such as video distribution or financial applications.
Several Solid State Drive devices can be installed in a server to form a RAID. SSDs and supported RAID controllers can be installed on several servers (e.g. System x and IBM iDataPlex® servers and BladeCenter® and IBM Flex System™ servers, which are all available from IBM Corporation). RAID arrays consisting of SSDs combine both the benefits of a RAID storage array and that of SSD devices, mainly fault tolerance and very fast data storage. Fault tolerance is provided in SSD RAID arrays by typical data reconstruction mechanisms onto spare disks as illustrated in the above examples of RAID levels 5 and 5E.
Generally, the inherent speed of SSDs allows for fast data reconstruction of a SSD RAID array when a SSD disk fails. However, SSDs exhibit some limitations: SSD disks can typically be read many times faster than they can be written to. As an example, SSD drives that are currently available from IBM® Corporation have a ratio of write speed to read speed which is either 1/4 or 3/20.
SSDs may be based on SLC (Single-Level Cell) or MLC (Multi-Level cell) NAND flash memory technology. SLC flash memory stores data in arrays of floating-gate transistors, or cells, 1 bit of data to each cell. MLC flash memory, in contrast to SLC flash memory, stores two bits of data per cell. MLC flash memory can be further delineated into two categories: Consumer-grade MLC (cMLC) used in consumer (single user) devices; and Enterprise-grade MLC (eMLC) designed specifically for use in enterprise environments (multiple user). Each of SLC, cMLC and eMLC have different characteristic read and write speeds and a different ratio between read and write speeds. For example, for a SLC device reading and writing 4 kB blocks of data, the read speed is 4,000 per second and the write speed 1,600 ps (i.e. read is 2.5 times faster than the write speed). This compares with HDD, for which a typical read speed is 320 ps and a typical write speed 180 ps (a ratio of 1.77). This asymmetry in read speed and write speed is even higher for cMLC and eMLC technologies, which typically achieve read speeds of 20,000 ps for 4 kB blocks of data and write speeds of 3,000 ps for the same size blocks—a ratio of 6.6. Thus, write operations are much slower than the read operations when using current SSDs.
In some RAID implementations, the difference between read speeds and write speeds during array reconstruction is partly due to the fact that data to be read is striped across several disks and the data is rebuilt on one dedicated spare disk. It is faster to read data in parallel from multiple disks, than to write the rebuilt data onto one dedicated spare disk. For example in RAID 5, during RAID array reconstruction, data is read from several remaining disks while the recovered data is being written to only one spare disk.
As will be understood by the person skilled in the art, distributed sparing does not suffer from the same level of asymmetry of read/write speeds. This is because a distributed sparing mechanism, such as the one in a RAID 5E storage system, involves a spare storage space distributed amongst the disks of the array (see FIGS. 3 and 4). Hence Input/Output operations that are required for the rebuild are spread across the remaining working disks, reducing the asymmetry in speeds of read and write operations. The asymmetry in read/write speeds does not, however, disappear in RAID arrays using a distributed sparing mechanism, because in the state of the art the number of disks being read from is generally equal to the number of disks being written to as the distributed spare space is on the same set of disks that are being read. It should be noted that distributed sparing schemes such as RAID 5E have not become ubiquitous as they have their own complexities and disadvantages. For example, reading from and writing to the same storage drive can be problematic. Dedicated sparing where a single disk is used as a spare is still the most widely used option for organizing the spare storage space.
In this context, the speed of writing data to a spare disk is a bottleneck during the rebuild of a RAID array. The time for rebuilding a RAID array is critical because, when a disk fails, there is a period of vulnerability which is characterized by intensive disk processing. During this time, the array reconstruction is vulnerable to a second failure. The longer it takes to rebuild the array, the longer this vulnerability period lasts. The speed of the array reconstruction is therefore critical when a disk fails.
The speed of reconstruction is also critical for a SSD RAID array because, generally, applications for which SSD technology is used are critical applications which do not tolerate high latencies (e.g. video distribution and financial analysis). The current bandwidth of SSDs is a bottleneck which limits the speed of write operations and therefore impedes the speed of SSD RAID array reconstruction. Therefore, there is a need to minimize the reconstruction time for storage arrays including high speed SSD arrays, in the event of a disk failure. Also, SSD devices have a limited lifespan in terms of numbers of accesses, so there is an expectation of the need for data rebuilds when SSDs are used for long-term persistent data storage.