While data storage capacity and central processing unit (CPU) processing power have experienced rapid growth in the past, improvement in data bandwidth and access times of disk input/output (I/O) systems have not kept pace. As a result, there is an ever-widening speed gap between CPU and disk I/O systems. Disk arrays can improve overall I/O throughput but random access latency is still very high because of mechanical operations involved. Large buffers and deep cache hierarchy can improve latency but the access time reduction has been very limited so far because of poor data locality at the disk I/O level.
Recent developments of flash memory-based solid state drives (SSD) have been very promising with rapid increase in capacity and decrease in cost. Because an SSD is on a semi-conductor chip it provides great advantages in terms of high-speed random reads, low power consumption, compact size, and shock resistance. Researchers in both academia and industry have been very enthusiastic in adopting this technology.
However, most existing research on SSDs focuses either on using an SSD in largely the same way as a hard disk drive (HDD), with various management algorithms at files system level and device level, or using an SSD as an additional cache in the storage hierarchy. The physical properties of SSDs impose constraints on both approaches that limit significant advances in the speed and reliability of disk I/O systems.
The limitations of SSDs result from their physical properties. A typical NAND-gate array flash memory chip that is widely used in SSDs consists of a number of blocks, each block containing a number of pages (e.g., a block with 64 pages of 2 KB each). Blocks are the smallest erasable units. Pages are the smallest programmable units. When a system performs a write operation it needs to first find a free page to write. If there is no free page available, an erase operation is necessary to make free pages. A read operation usually takes a few or tens of microseconds, whereas a write operation takes hundreds of microseconds and an erase operation takes from 1.5 to 3 milliseconds.
A more important limitation is imposed by the maximum number of erase operations that may be performed on a block of flash memory during the lifetime of a flash memory chip. Typically, a block can be erased for only 10K times in a multi level cell (MLC) memory element or 100K times in a single level cell (SLC) memory element. After that, the block becomes bad. For example, a block of MLC memory that is erased and reprogrammed every minute will be dead in 7 days because 60×24×7=10,080 erase operations, which exceeds the life cycle of the memory element. The lifetime of a flash memory is typically extended by wear leveling that distributes erase operations evenly across all blocks. As a result, write operations in flash memory SSDs are not done in-place as is done in HDDs and are much slower than read operations.
It is clear from the above discussions that allowing random writes to SSDs in the same way as to HDDs is not an optimal approach. Using an SSD as another level of storage cache cannot avoid random writes either. In addition, lower level storage cache provides limited performance benefits because data locality at disk I/O level is very weak. The best cache hit ratios of second level disk caches (in theoretically optimal caches with off-line caches managed manually in an optimal way) ranges from 16.5% to 86.4% for cache sizes between 16 MB and 2 GB, depending on applications.
High performance, low cost multi-core graphics processing units (GPU)/CPUs represent another dramatic technology advance. GPUs have traditionally been thought of as commodity chips to drive consumer video games. However, the push for realism in such games along with the rapid development of semiconductor technologies has made GPUs capable of supercomputing performance for many applications at very low cost. There are many low-end to medium GPU controller cards available on the market for under $100 that deliver extraordinary computation power. There has already been extensive research in using GPUs for general purpose computing (GPGPU). Besides high performance and low cost, there has also been a technology drive for reliable and low power GPUs. For example, an embedded system using the ATI Radeon HD 3650 GPU draws very little power but delivers performance levels of hundreds of GFLOPS. The next-generation mobile GPUs are expected to nearly double this performance with a similar power envelope.
With such rapid development of GPU/CPUs, experiments have been carried out on GPU cards such as NVIDIA 9500GT and ATI Radeon HD 2400 PRO. Specifically, the execution time of computing Alder32 and Rabin fingerprint values of large data blocks in parallel were measured on multi-core GPUs and it was observed that a straightforward program implementation takes 60 to 90 microseconds to compute hash values of all chunks of 128 B in an entire data block of size 4 KB to 32 KB. This promising computing speed makes it possible to do on-the-fly computation for disk I/O operations.
Researchers in computer systems have long observed the strong regularity and content locality that exist in memory pages. Memory pages contain data structures, numbers, pointers, and programs that process data in a predefined way. Such strong regularity and content-locality have been successfully exploited for in-memory data compression. Large files and collections of files also show strong content locality with large amounts of data redundancy that can be eliminated by efficient compression algorithms. Delta encoding has been successfully used to eliminate redundancy of one object relative to another, suggesting that many data blocks can be represented as small patches/deltas with respect to reference blocks. Furthermore, recent research has shown strong content locality in many data-intensive applications, with only 5% to 20% of bits inside a data block being actually changed on a typical block write operation.
Besides the strong regularity and content locality inherent in block data, the most popular computing platform, virtual machines, provides us with additional opportunities for content locality. The emergence of cloud computing requires hundreds, even thousands of virtual machines running on servers and clients. Such widespread use of virtual machines creates a problem of virtual machine image sprawl where each virtual machine needs to store the entire stack of software and data as a disk image. These disk images contain a large amount of redundant data. Gupta et al. have recently presented a powerful Difference Engine that has successfully exploited such content locality to perform memory page compression with substantial performance gains. This strong content locality suggests again the possibility of organizing data differently in data storage to obtain optimal performance.