Computer clusters, or groups of linked computers, have been widely used to improve performance over that provided by a single computer, especially in extended computations, for example, involving simulations of complex physical phenomena, etc. Conventionally, in a computer cluster, computer nodes (also referred to herein as client nodes, or data generating entities) are linked by a high speed network which permits the sharing of the computer resources and memory. Data transfers to or from the computer nodes are performed through the high speed network and are managed by additional computer devices, also referred to as file servers. The file servers file data from multiple computer nodes and assign a unique location for each computer node in the overall file system. Typically, the data migrates from the file servers to be stored on rotating media such as, for example, common disk drives arranged in storage disk arrays, or solid-state storage devices for storage and retrieval of large amount of data. Arrays of solid-state storage devices such as flash memory, phase change memory, memristors, or other non-volatile storage units, are also broadly used in data storage systems.
The most common type of a storage device array is the RAID (Redundant Array of Inexpensive (Independent) Drives). The main concept of the RAID is ability to virtualize multiple drives (or other storage devices) in a single drive representation. A number of RAID schemes have evolved, each designed on the principles of aggregated storage space and data redundancy.
Most of the RAID schemes employ an error protection scheme called “parity” which is a widely used method in information technology to provide for tolerance in a given set of data.
For example, in the RAID-5 data structure, data is striped across the hard drives, with a dedicated parity block for each stripe. The parity blocks are computed by running the XOR comparison of each block of data in the stripe. The parity is responsible for the data fault tolerance. In operation, if one disk fails, a new drive can be put in its place, and the RAID controller can rebuild the data automatically using the parity data.
Alternatively to the RAID-5 data structure, the RAID-6 scheme uses the block-level striping with double distributed parity P1+P2, and thus provides fault tolerance from two drive failures. They can continue to operate with up to two failed drives. This makes larger RAID groups more practical, especially for high availability systems.
Ever since the adoption of RAID technology in data centers, there has been the problem of one application (or one host) dominating the usage of drives involved in the RAID. As a result, other hosts (or applications) are resource starved and their performance may decrease. A typical solution in the past was to dedicate a certain number of drives to the particular host (or application) so that it does not affect the others.
With the introduction of de-clustered RAIDs organizations, virtual disks are dynamically created out of large pool of available drives with the intent that a RAID rebuild will not involve a large number of drives working together, and thus reduce the window of vulnerability for data loss. An added benefit is that random READ/WRITE I/O (input/output) performance is also improved.
Parity de-clustering for continued separation in redundant disk arrays has advanced the operation of data storage systems. The principles of parity de-clustering are known to those skilled in the art and presented, for example, in Edward K. Lee, et al., “Petal: Distributed Virtual Disks”, published in the Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, 1996; and Mark Holland, et al., “Parity De-clustering for Continuous Operation in Redundant Disk Arrays”, published in Proceedings of the Fifth Conference on Architectural Support for Programming Languages and Operating Systems, 1992.
FIG. 1A, as shown in Mark Holland, et al., represents the principle of parity and data layout in traditional RAID-5 organization. Di,j shown in FIG. 1A, represents one of the four data units in parity stripe number i, and Pi represents the parity unit for parity stripe i. Parity units are distributed across the disks of the array to avoid the write bottleneck that would occur in a single disk containing all parity units. The disk array's data layout provides obstruction of a linear (“logical block”) address spaced to the file system. In addition to mapping the data units to parity stripes, the illustrated RAID-5 organization also specifies the data layout: data is mapped to stripe units Di,j according to ascending j within ascending ji, meaning that user data is logically D0.0, D0.1, D0.2, D0.3, D1.0, D1.1, etc.
In FIG. 1A, parities computed over the entire width of the array, that is, P0 is accumulative parity (XOR) of data units D0.0-D0.3. When a disk is identified as failed, any data unit can be reconstructed by reading the corresponding units in the parity stripe, including the parity unit, and computing the cumulative XOR of this data. All the disks in the array are needed by every access that requires reconstruction.
Let G be the number of units in a parity stripe, including the parity unit, and consider the problem of decoupling G from the number of disks in the array. This reduces to a problem of finding a parity mapping that will allow parity stripes of size G units to be distributed over some larger number of disks, C. The larger set of C disks is considered to be a whole array. For comparison purposes, the RAID-5 example in FIG. 1A has G=C=5. This property (G=C) defines RAID-5 mappings.
One perspective of the concept of parity de-clustering in redundant disk arrays is demonstrated in FIG. 1B where a logical RAID-5 array with G=4 is distributed over C=7>G disks, each containing fewer units. The advantage of this approach is that it reduces the reconstruction workload applied to each disk during failure recovery. Here for any given stripe unit on a failed (physical) disk, the parity stripe to which it belongs includes units on only a subset of the total number of disks in the array. In FIG. 1B, for example, disk 2, 3 and 6 do not participate in the reconstruction of the parity stripe marked “S”. Hence, these disks are called on less often in the reconstruction of one or the other disks. In contrast, RAID-5 array has C=G, and so all disks participate in reconstruction of all units of the failed disk.
FIG. 1C represents a de-clustered parity layout for G=4 and C=5. It is important at this point that fifteen data units are mapped onto five parity stripes in the array's first 20 disk units, while in the RAID-5 organization shown in FIG. 1A, sixteen data units are mapped onto four parity stripes in the same number of disk units.
More disk units are consumed by parity, but not every parity stripe is represented on each disk, so a smaller fraction of each surviving disk is read during reconstruction. For example, if in FIG. 1C, disk 0 fails, parity stripe 4 will not have to be read during reconstruction. Note that the successive stripe units in a parity stripe occur in varying disk offsets.
As presented in Edward K. Lee, et al., clients use the de-clustered redundant disk arrays as abstract virtual disks each providing a determined amount of storage space built with data storage units (blocks) of physical disks included in the virtual disk.
Unfortunately, using all of the available drives in de-clustered RAID architectures precludes the option of isolating ill-behaved applications (or hosts) from the rest of the system.
Virtual disks are provided in de-clustered RAID organizations in an attempt to evenly distribute the data over as many drives as possible. Unfortunately, not all host activity for a specific virtual disk is evenly distributed. As a result, certain sections of a virtual disk have more activity than others, and some virtual disks will in general have more activity than others as well. Compounding the activity inequality is that changes in activity may occur over periods of days or weeks, which means that previously inactive virtual disk may suddenly become very active, and a virtual disk that had been active for weeks might suddenly become inactive for months.
Currently this problem is approached in the field by the concept of moving contents of entire virtual disks, or subsections of a virtual disk, to another storage tier (such as solid-state disk versus fast drives versus near line drives) based on activity rather than resolving activity conflicts within a tier of data storage disks.
Another approach is to employ a solid-state disk READ cache. This performance improvement is typically carried out via hierarchical storage management which moves data sets from one type of media to another, i.e., from SATA (Serial ATA) physical disks to SAS (Serial attached SCSI) physical disks, fiber channel physical disk, or solid-state disks. Data that is not in current use is often pushed out to slower speed media from fast speed media. Block storage devices such as the SFA (Storage Fusion Architecture) often do not have visibility of the data storage on them. As a result, an SFA device must move the entire contents of virtual disk to slower or faster media in order to improve overall system performance.
It is therefore clear that a more efficient approach requiring no large data volumes movement from media to media and providing an evenly distributed I/O activity in the data storage de-clustered RAID system would greatly benefit the RAID technology.