Redundant Array of Independent Disks (RAID) is a technology that utilizes a collection of independent disks in a coordinated fashion to achieve better performance, greater reliability, increased capacity, or a combination of these features. RAID levels 0, 1, 5 and 6 are the most commonly used. RAID level 0 stripes data across all disks in the array to achieve improved performance. Each disk is a single point of failure such that if one disk fails, all data on the array is lost. RAID level 1, on the other hand, targets improved reliability. Data on the array is mirrored across all disks in the array. If one disk fails, data can be accessed through any of the remaining mirrored disks in the array. RAID level 5 combines improved reliability and performance. For each stripe of data blocks, a parity block is computed based on the data blocks and written to a separate disk in the array. There is no dedicated parity disk and in case of a single drive failure, data can be accessed and reconstructed using the remaining disks and the corresponding parity blocks. RAID level 6 provides a reliability improvement over RAID level 5, as it utilizes two independent parity blocks for each stripe of data. It can protect from two drive failures. Other RAID levels are defined in literature, as well. For more details on RAID technology, see “A Case for Redundant Array of Inexpensive Disks (RAID)”, by David A. Patterson, Garth Gibson, and Randy H. Katz from University of California Berkley dated 1988. Solid State Disks (SSDs), for example, NAND Flash memory-based SSDs, are popular storage media devices alongside magnetic disks. Our experiments using an array of various SSDs in a RAID configuration has revealed a fundamental performance bottleneck. For details on this performance bottleneck, see “An Empirical Study of Redundant Array of Independent Solid-State Drives (RAIS)”, Y. Kim, S. Oral, D. Dillow, F. Wang, D. Fuller, S. Poole, and G. Shipman, Technical Report, ORNL/TM-2010/61, National Center for Computational Sciences, March 2010.
SSDs are compatible with existing disk technologies including disk drivers, input/output (I/O) buses, system software and operating systems. This compatibility allows easy replacement of individual magnetic disks with SSDs in existing storage systems. SSDs are pure semiconductor devices and do not have any mechanical moving parts (unlike magnetic disks, which are mechanical devices). This eliminates disk head seek latencies and increases performance for various I/O workloads. SSDs are also more resilient to mechanical disturbances compared to magnetic disks. As SSD technologies mature, mass production costs are dropping. This triggers reduced market prices, making SSDs more available to consumers. Altogether, these factors are making SSDs an attractive alternative to magnetic disks.
Current SSD technology supports three basic I/O operations: write, read and erase (magnetic disk technology supports only the first two). The basic unit of data storage is SSDs is a page (a group of flash memory cells, typically in 4 kilobyte (KB) capacity). Pages are further grouped into blocks. Granularity for reads and writes is at the page-level, whereas the granularity of an erase operation is at the block-level in SSDs.
As stated above, SSDs are purely electronic devices (no mechanically rotating or moving parts such as disk heads, rotator arms, etc.). SSDs have consistent read performance (the spatial locality of data on an SSD is irrelevant to the read operation as there is no disk head). However, writing into SSDs is slower and more complicated compared to reading as explained below.
Full system delete operations only flag data blocks as “not in use” at the file system level, using the file system's block usage map. Storage devices (SSDs and magnetic disks) lack an accurate view of this block map (indicating which data blocks are actually in use and which became available). When the operating system writes to a block that was recently freed by the file system (but not by the storage device), it is translated as an overwrite operation at the storage device level. This is not a problem for magnetic disks because there is no difference between writing to a free block and overwriting a used one. Unlike conventional magnetic disks, SSDs require a block to be erased prior to being written. A simple approach to updating data within a block on an SSD given this constraint, would be to read the block into volatile memory, modify the block in memory with the updated data, erase the underlying block, and finally write the updated data from volatile memory. This approach is defined as a read-modify-erase-write. Unfortunately erase operations on SSDs have higher overhead on SSD devices when compared to read and write operations making this read-modify-erase-write cycle inefficient. To overcome these inefficiencies, SSDs use a copy-on-write operation in which the contents of a block are copied into the memory and modified there, then written to a known free block. The original target block for the overwrite operation is then marked as “invalid.” Although this is more efficient compared to the read-erase-modify-write method (since it does not require an erase operation), the number of available free blocks decreases over time and must be reaped/reclaimed. SSDs solve this problem by using a mechanism called garbage collection (GC). GC is a term defining the process of reclaiming “invalidated” pages and creating usable free space on an SSD. Current SSD technology uses GC processes controlled by the SSD with different algorithms and policies that are vendor specific. Generally, during an ongoing GC process incoming requests are delayed until the completion of the GC if their target is the same Flash chip that is busy with GC. For example, during an ongoing GC process incoming requests targeted for the same Flash device that is busy with the ongoing GC process are stalled and placed in a queue and scheduled for service following the completion of the GC process. This stalling can degrade performance when incoming requests are bursty.
Fragmentation caused by small random writes increases the GC overhead. It has been empirically observed that GC activity is directly correlated with the frequency of write operations, the amount of data written, and the free space on the SSD. Under certain circumstances, the garbage collection (GC) process can significantly impede SSD I/O performance (e.g., overlapping writes with an on-going GC process). See “An Empirical Study of Redundant Array of Independent Solid-State Drives (RAIS)”, Y. Kim, S. Oral, D. Dillow, F. Wang, D. Fuller, S. Poole, and G. Shipman, Technical Report, ORNL/TM-2010/61, National center for Computational Sciences, March 2010.
Using SSDs in a RAID array configuration for increased storage capacity and performance is an attractive idea since a collection of SSDs presents a cost-effective solution in terms of price/performance and price/capacity ratios for various I/O workloads compared to a single SSD device of similar capacity and performance.
With current SSD technology, GC processes for individual SSDs are local and there is no coordination at the RAID-controller level. This lack of coordination causes individual GC processes to execute independently resulting in aggregate performance degradation at the RAID level.