Computer clusters, or groups of linked computers, have been widely used to improve performance over that provided by a single computer, especially in extended computations, for example, involving simulations of complex physical phenomena, etc. Conventionally, in a computer cluster, computer nodes (also referred to herein as client nodes, or data generating entities) are linked by a high speed network which permits the sharing of computer resources and memory. Data transfers to or from the computer nodes are performed through the high speed network and are managed by additional computer devices, also referred to as file servers. The file servers file data from multiple computer nodes and assign a unique location for each computer node in the overall file system.
Typically, the data migrates from the file servers to be stored on rotating media such as, for example, common disk drives arranged in storage disk arrays, or solid-state storage devices for storage and retrieval of large amounts of data. Arrays of solid-state storage devices (such as flash memory, phase change memory, memristors, and other non-volatile storage units) are also broadly used in data storage systems.
The most common type of a storage device array is the RAID (Redundant Array of Inexpensive (Independent) Drives). The main concept of the RAID is the ability to virtualize multiple drives (or other storage devices) into a single drive representation. A number of RAID schemes have evolved, each designed on the principles of aggregated storage space and data redundancy.
Most of the RAID schemes employ an error protection scheme called “parity” which is a widely used method in information technology to provide for tolerance in a given set of data.
For example, in the RAID-5 data structure, data is striped across a number of hard drives, with a dedicated parity block for each stripe. The parity blocks are computed by running the XOR comparison of each block of data in the stripe. The parity is responsible for the data fault tolerance. In operation, if one disk fails, a new drive can be put in its place, and the RAID controller can rebuild the data automatically using the parity data.
Alternatively to the RAID-5 data structure, the RAID-6 scheme uses the block-level striping with double distributed parity P1+P2, and thus provides fault tolerance from two drive failures. They can continue to operate with up to two failed drives. This makes larger RAID groups more practical, especially for high availability systems.
Disk drives are mechanical devices which are built with rotating media that requires a read-and-write head to be moved along the rotating media's surface in order to store or retrieve data. Nowadays disk drives with the data storage of up to 10 TB (Terabyte) are capable of rotating with the speed of up to 15 K RPM (Revolutions per Minute). In addition to spinning, the drive must also move the read/write head back and forth between the tracks. Unfortunately, even with the rotation speed of 15 K RPM and the seek time of the head of about 8 ms, most of contemporary disk drives can only maintain a sustained transfer rate of about 143 MBs (Megabits per second).
In order to eliminate complex timing algorithms from the computer server, the disk drives are provided with an on-board cache which acts as an elastic buffer. This buffer permits a timing disconnect between the commands “read” or “write” so that the server can issue several commands in a rapid order without having to wait for the read-and-write head to arrive at the correct destination on the rotating media, i.e., while the read-and-write head is still in the “seek” mode.
One of the primary advantages of RAID is that data is striped across multiple drives. Since each drive has the on-board cache, the server can sequentially issue commands across multiple drives in rapid succession, and rotation (or execution) of commands across multiple drives is carried out in the striping process. Thus, by the time the server finished issuing commands to the last drive in the sequence, the first drive advantageously may be ready to accept another command. In this manner, the server is able to continually issue commands to the drives without having to wait for a command to be completed. This speed advantage exists as long as the connection speed to the drive is faster than the maximum transfer rate of the disk drive.
In a traditional RAID scheme, shown in FIG. 1, a pool 10 includes a number of drives 12, which accommodate parity stripes a1, a2, . . . , an each of which includes data chunks 14 and parity data, for example, 16 and 18.
The on-drive cache sizes vary, with 64 MB being a typical size. With a SAS (Serial Attached SCSI) connection speed of moving data to the drives of about 600 MBs, it would take approximately 133 ms to fill the on-board cache on a disk drive. Thus the server can overflow the cache on a single drive rather quickly.
By writing sequentially (horizontally) to each drive 12 in the RAIDset (pool) 10, the server can extend the amount of time it takes to fill all of the caches by an order of 2, 5, or 10×. Even with 10 drives, it only takes about 1.3 seconds to reach the overflow condition in the RAID pool.
During I/O activities, the on-board caches are also drained on the drive side, thus moderating the overflow. When reading or writing in a sequential (“horizontal”) fashion, a typical hard drive can sustain a transfer rate of 80 to 150 MBs. Thus it is the ratio of the input speed to the output speeds that determines the overflow. With an input speed of 600 MBs and an outflow speed of 150 MBs, a set of 4 drives is sufficient to saturate the SAS connection. If the maximum transfer rate drops, for example, to 120 MBs, then 5 drives will be needed to maintain saturation.
With an RAID-5 (4 data and 1 parity) redundancy scheme, a single SAS channel is saturated at 600 MBs. An RAID-6 (8 data and 2 parity) redundancy scheme will require 2 separate SAS channels in order to maintain saturation. A solution to increasing the bandwidth to a RAIDset is to have each drive on its own SAS (Serial Attached SCSI) Channel. However, once again this raises the issue of overflowing the drive caches for a single drive after approximately 133 ms of the I/O operation.
It has been observed that the write/read speed temporarily bursts at a high data rate until the overflow occurs, and then it slows down to the maximum transfer rate multiplied by the number of drives in the RAIDset. An RAID-6 (8 data and 2 parity) redundancy scheme would top out at a transfer rate of about 10×120 MBs=1.2 GBs.
Currently the only way to leverage multiple drives other than the current stripe shown in FIG. 1 in a RAIDset is to define multiple Virtual Disks (VD) as is done in de-clustered RAIDs organizations, where virtual disks are dynamically created out of large pool of available drives with the intent that a RAID rebuild will involve a large number of drives working together, and thus reduce the window of vulnerability for data loss. An added benefit is that random READ/WRITE I/O (input/output) performance is also improved.
The principles of parity de-clustering are known to those skilled in the art. For example, they are presented in E. K. Lee, et al., “Petal: Distributed Virtual Disks”, published in the Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, 1996; and Mark Holland, et al., “Parity De-clustering for Continuous Operation in Redundant Disk Arrays”, published in Proceedings of the Fifth Conference on Architectural Support for Programming Languages and Operating Systems, 1992. As presented in Edward K. Lee, et al., clients use the de-clustered redundant disk arrays as abstract virtual disks, each providing a predetermined amount of storage space built with data storage units (blocks) of Physical Disks (PD) included in the virtual disk.
The multiple VDs can be defined within the same storage system or multiple storage systems. The down side to multiple VDs is that some additional piece of software is needed for striping and merging the multiple VDs into a single presentation for a server access. For example, in Linux this is normally accomplished with LVM cache, and in MS Windows a Disk Manager is used for the same purpose. As minor an effort as it might be to merge the VDs, it still increases the CPU load for the servers and adds complexity to their management, especially in a server de-clustered environment.
One of the advantages of parallel file systems, such as the GPFS and Lustre, is that they stripe their I/O across multiple storage systems which has the effect of increasing the number of Physical Drives in a VD which tends to maximize data transfer rates for sequential access. However, both GPFS and Lustre operate on an external server independent of the storage. Linux and MS Windows can stripe VDs together but they also run on external servers, which complicates the system structure and operation.
A more efficient approach for preventing the overflowing of the drives' on-board cache in the de-clustered RAID organization and accomplishing the striping without the need for multiple storage systems or additional software and/or external servers would greatly benefit the RAID technology.