The hard disk remains the primary mass storage device for small to mid-sized computers. As an electromechanical device, the hard disk performance is limited by the physical characteristics of the drive consisting primarily of seek times and rotational delays. Once the mechanical portion of a read or write access has been performed, the data transfer rate is limited by characteristics of the media, the head, and the R/W electronics. This data transfer rate is in the range of 50 to 100 MBPS (megabytes per second) for currently available products.
The Disk Array
A system might use multiple disk drives, i.e. an array of drives, if the required capacity, performance or reliability exceeds that available from a single drive. Capacity enhancement is the most common motivation. Two drives of a given size can store twice as much data as either single drive; compare FIG. 1A to FIG. 1B. Reliability enhancement is less obvious. Two drives of given type will have twice the failure rate of a single drive. On the other hand, a system may be arranged so that the second drive always has an exact copy of the data on the first drive. (This is often called “mirroring”.) If either drive should fail, the data is not lost as there is still a copy; see FIG. 1C (where data A′ is a copy of data A). If a failed drive is immediately replaced and the copy rebuilt, the probability of data loss drops down to the probability of two drive failures within the time required to replace a failed drive and to rebuild the copy. This is much lower than the probability of any single drive failing.
Performance is the third reason that a system might require the use of a drive array. There are two main cases. First, a high speed streaming application may require a higher sustained bandwidth than a single drive can deliver. A system with N drives can potentially provide N times the sustained bandwidth of a single drive. And second, the access time for a given drive, determined primarily by seek and rotational delays, limits the number of IO operations that can be performed per second (“IOPS”). An N-drive array can potentially support N times the IOPS performance of a single drive, in the best case. An example is illustrated in FIG. 1D.
Data Architectures
The simple addition of a second drive to a system will immediately double the capacity, but there may not be a performance improvement. Should the bulk of the accesses be to one of the drives or the other, the performance will be limited by performance of the more popular drive. Striping is a known technique which distributes data over the available drives so that retrieving the data will require the participation of all of the available drives, thereby allowing the system to attain performance approaching the aggregate performance of the drives.
The smallest addressable unit of storage on a typical disk drive is the sector. A sector of data is typically an exponential multiple of two bytes in length. In this application, a sector size of 512 bytes will be used for purposes of illustration but not limitation. To stripe data, an order or sequence is assigned to the drives and a stripe width is selected. A pair of drives may be identified as 0 and 1. The stripe width might be 4K bytes which is eight sectors of 512 bytes each. With these selections, the first 4K block of user data (“User 0” in the drawing) is stored in the first 4K block of drive 0. The second 4K block of user data (“User 1”) is in the first 4K bock of drive 1. The third 4K block of user data is stored in the second 4K block of drive 0 and the third 4K block is stored in the second 4K block of drive 1. This arrangement is illustrated in FIG. 2. This process is repeated, alternating the storage of 4K blocks between the two drives, until the ends of the drives are reached. If the system has a large number of small accesses of one or two sectors, the two drives may be accessed concurrently to attain twice the random access performance of a single drive. If the system is accessing relatively large data blocks, say of 100K, the two drives may once again be operated concurrently to attain nearly twice the sustained performance of a single drive. Consequences of the stripe width selection will be discussed below.
Redundancy
As described above, a drive may be added to keep a continuously updated backup copy of a primary drive. For this simple approach, any disk write operation is simply duplicated on the backup drive. The backup drive is an exact copy of the primary drive. This technique is often known as “mirroring” or RAID1. Data may be read from either drive until one of the drives fails at which point the remaining drive is selected for reads. The increased reliability results in a 100% increase in the cost of storage, i.e. one mirror drive is required for each primary drive.
There are techniques for protecting data with an incremental cost of less than 100%. Consider the two-drive array previously described with the 4K data stripe. An additional drive, the same size as the original two, may be added to the array. This drive is referred to as the “redundant” drive. See FIG. 3. In this arrangement, each 4K block of the redundant drive receives the XOR of the corresponding 4K blocks of the other two drives. For any single drive failure, the contents of any 4K block of the failed drive can be reconstructed by computing the XOR of the corresponding 4K blocks of the remaining data drive and the redundant drive. In general, for an array with data striped across N drives, the XOR of all of the data blocks in the stripe is stored on the redundant drive. Once again, any block in the stripe can be reconstructed by XORing the remaining blocks of the stripe (including the redundant drive block). The added cost of the redundancy is reduced to 1/N where N is the number of data drives. FIG. 4 illustrates an array with three data plus a redundant drive. For each stripe, the redundant drive contains the XOR of the corresponding blocks in the three data drives. In general, there is no physical difference between “data drives” and “redundant drives”. We use those labels herein as a convenient reference to a drive's designated function in the stripe. These drive functional assignments are typically rotated between stripes because the parity drive tends to become the bottleneck for applications with a high percentage of writes and this rotation tends to balance the load.
For read accesses, assuming no drive failure, the performance of the redundant array is the same as the striped array performance without redundancy. The reconstruction of a data block of a failed drive, however, requires additional disk activity to access each of the remaining drives in the array and additional processing of the data for the XOR computation. Also, the updating of any block will invalidate the redundant block for that stripe, requiring an update of the redundant block as well.
As noted above, a system with two drives can provide either redundancy by using one drive to mirror the other, or it can double the capacity and provide up to a 2× performance improvement. The issuing of extra disk write commands required to maintain a copy or the extra operations required to distribute or collect data striped between two drives can easily be handled by the driver software using a disk controller that does not provide any specialized array functions. For systems with a redundant array of three or more drives, however, the XOR computations and the additional disk activity can significantly benefit from specialized hardware with or without local intelligence. In today's market, the two-drive arrays typically are handled in software. Larger arrays utilize specialized disk controller hardware which may be located on the motherboard, in a plug-in card, or an external box.
Redundancy Hardware
There are industry standards which describe in detail the mechanical, electrical, and logical interfaces of disk drives. A drive may be attached to a system by providing an interface commonly called a controller or an adapter meeting the requirements of the interface standard. For any system in which performance is an issue, Direct Memory Access or DMA is used by the controller to transfer disk data between the drive and system memory.
As a context for examining acceleration hardware, consider the array consisting of three data drives plus a redundant drive; see FIG. 4. Before a drive has failed, accessing a block of data requires only that the target drive be read and the data transferred by DMA to memory. There is one disk access and 4K bytes of data are transferred into the memory. If this drive should fail, accessing the same data block will require reading the balance of the stripe, i.e. the corresponding blocks of the same stripe from all of the other drives.
Each of the remaining drives is read with the data transferring by DMA to memory. Even though the three drives may have identical average access characteristics, the read operations will actually complete at different times for various reasons, including the fact that the initial states of the head position and rotational position are independent. Referring now to FIG. 35A, it shows this asynchronous data transfer from Data 0, Data 1 and PAR (parity or redundant drive) drives via respective DMA channels to corresponding buffer memory. The Data 2 drive has failed. Once the three blocks are stored in a buffer, the XOR operation can be performed to reconstruct the missing data. Referring now to FIG. 35B, to XOR the three streams, one element is read from each of the streams, the three elements are XORed in logic 620, and the resulting element is stored in a new block of the memory 622. Note that an element may be of any convenient size for the memory and DMA hardware involved. This process required three disk accesses: 12K of data was transferred into the memory from the disks (using 4k blocks for illustration); 12K of data was read back out of the memory for XOR computation; and 4K bytes of data were written back into the memory 622 for a total of 28K of data transfers into or out of the memory.
From the foregoing example, we observe the following:
1. While the accessing of a 4K data block required only 4K of data transfer, the post-failure access required 28K bytes total of buffer access, a seven-fold increase in system bus and memory bandwidth loading.
2. The XOR computation could not begin until data blocks had been received from all three drives. Thus the entire XOR process adds to the total latency of the read operation, creating an incentive to make the buffer memory and XOR engine as fast as might be practical. Note that while the XOR process might have been started on the first two streams, extra bandwidth would be required to store the intermediate results and to fetch them once again to be XORed with the final data block.
3. The post-failure essentially tripled the overhead for drive management.
Synchronous Redundant Data Transfers
Typically disk drives are internally buffered in order to decouple the data transfer rate of the RAN head from the transfer rate of the drive interface. This internal buffer, and its ability to accommodate various interface speeds, can be exploited to enhance redundancy operations while significantly reducing the hardware requirements. Consider the ATA/ATAPI interface in its original parallel Programmed Input Output or PIO mode of operation. In this mode, a single sixteen-bit word of data was read from or written to the drive's internal buffer using a read or write strobe (DIOR or DIOW) provided by the controller/adapter. In the discussion of redundancy hardware above, recall that reconstructing a block of data that was on a failed drive required reading the remaining three drives, transferring their data into a local buffer, and then reading the three streams from the buffers in order to compute the XOR function. This was because the drives, while all operating concurrently, were not synchronized to each other, so each transferred data at different times. (We use “internal buffer” to refer to a disk drive's internal buffer, as distinguished from buffer memory in a controller, adapter or host.)
An alternative technique is known as Synchronous Redundant Data Transfers or SRDT. With Synchronous Redundant Data transfers:
1. The read commands are issued to all three (or N) drives. Read data is not immediately transferred when less than three (or N) drives have data available in their internal buffers.
2. However, when the read data from all three drives is available in their respective internal buffers, the XOR process can begin. An XOR engine fetches a first element from each drive; computes the XOR of the three elements; and outputs the first element of the result to the buffer within the controller/adapter. This redundancy operation is “on the fly” as it occurs as data is moved from the drive to the buffer, as distinguished from first storing data in the buffer, and then having to read it out to do the redundancy operations as described above.
For the ATA/ATAPI drive in PIO mode, the element size is a single sixteen bit word, the width of the interface. The element fetching is accomplished by asserting the DIOR strobe to the three drives simultaneously. The use of the common DIOR strobe makes the data transfer “synchronous”. In the scheme described above under redundancy hardware, the XOR process could not start until the data from the last drive had been transferred to the memory.
In the Synchronous (SRDT) scheme, the process begins as soon as the data from the last drive is available in that drive's internal buffer. Assuming that the read strobes are generated at the maximum rate supported by the drives, the advantages of the Synchronous Data Transfer and the On-The-Fly redundancy computation are as follows:
1. From the time when the last-to-finish drive has the read data ready in its buffer, the XOR is computed and the result is transferred with the same latency as the transferring of data from a single drive prior to the failure. The additional latency of fetching the three blocks from the buffer, computing the XOR, and storing the result to the buffer are all eliminated.
2. The total amount of data transferred to the buffer is the 4K block that was originally transferred. The total buffer bandwidth required in this example is that bandwidth required to support a single drive.
3. The data from the three drives is reduced to a single stream. Only a single DMA context (address and count) is required for the operation rather than one DMA context per drive as was required in the original example. This efficient operation, however, is dependent on using a storage element size equal to the width of the drive interface (“narrow striping”), and it is limited to synchronous transfers invoked by applying a common DIOR strobe to all of the drives in the array.
In view of the above background discussion, several problems remain. With the PATA technology, a controller could synchronously access multiple drives allowing it to perform redundancy computations on-the-fly for improved RAID performance with markedly reduced hardware complexity. The technologies which have evolved, first UDMA and then SATA, are source synchronous which would not allow a controller to synchronously transfer read data.
Second, the prior art used synchronous data transfers with on-the-fly redundancy with stripe widths of a few bytes or words, merging the drive data into a single stream which could be transferred to or from a buffer with a single DMA channel. Techniques are needed to extend the use of synchronous data transfers and on-the-fly redundancy to arrays with stripe widths of a sector or more.
In addition, current non-synchronous techniques for implementing wide striped controllers transfer disk data to memory before any XOR computations can be performed. This is wasteful of buffer bandwidth relative to techniques that might perform redundancy operations without first transferring all of the data to memory and having to read it back again for computations.