As the speed and size of networked computer systems have continued to increase, so has the amount of data stored within, and exchanged between, such systems. While a great deal of effort has been focused on developing larger and more dense storage devices, as well as faster networking technologies, the continually increasing demand for storage space and networking bandwidth has resulted in the development of technologies that further optimize the storage space and bandwidth currently available on existing storage devices and networks. One such technology is data compression, wherein the data saved to a storage device, or transmitted across a network, is manipulated by software to reduce the total number of bytes required to represent the data, and thus reduce the storage and bandwidth required to store and/or transmit the data.
Data compression can be divided into two general categories: lossy data compression and lossless data compression. As the terms imply, lossy data compression (sometimes referred to as perceptual coding) allows for some loss of fidelity in the encoded information, while lossless data compression requires that the decompressed data must be an exact copy of the original data, with no alterations or errors. While lossy data compression may be suitable for applications that process audio, image and/or video data, a great many other data processing applications require the fidelity provided by lossless data compression.
Most existing lossless data compression techniques are iterative in nature, and generally are optimized for software implementations. These software-based lossless compression techniques are typically not well suited for use in applications requiring high speed/low latency data throughput, where even small processing delays may be unacceptable. Some hardware-based implementations do exist, but many such implementations process one byte at a time, and are thus limited to the clock frequency at which the hardware can be operated. Other hardware implementations are capable of processing multiple byes at one time, but these implementations do so at the expense of compression efficiency.
While data compression techniques attempt to addresses storage space and bandwidth concerns by reducing the amount of data that is stored on (and transmitted to and from) a storage device, other techniques attempt to address bandwidth concerns by limiting the number of times data is read from and written to the storage devices. One such technique is “caching,” wherein a copy of the desired data on the storage device is maintained in memory after an initial read or write, and subsequent accesses to the data are directed to the in-memory copy. While caching works well for data that is stored together in one area of a disk (e.g., within adjacent sectors) or related areas (e.g., different platters but within the same cylinder), wherein the data is retrieved within either a single access or a small number of sequential accesses with minimal repositioning of the read/write head of the storage device, it does not work well with data that is distributed over different areas within a storage device or even different storage devices. Such a distribution can occur in data that is heavily modified after its initial storage, particularly in systems that use “thin provisioning” combined with “sparse mapping.”
In systems that combine thin provisioning with sparse mapping, storage is virtualized and appears as being allocated when requested (e.g., by opening a file or creating a directory), but the actual physical storage is only allocated on an “as-needed” basis when the data is actually written to disk (i.e., allocated on an I/O-basis). Further, different files and file systems are “sparsely” distributed (i.e., mapped) over the logical block address space of the virtual disk (i.e., separated by large unused areas within the address space), but are sequentially allocated physically adjacent storage blocks on the physical disk. As a result, adjacent blocks on the physical disk can be associated with different files on the virtual disk. Further, as files are modified and expand, the additional file extents could be allocated anywhere on the physical disk, frequently within unrelated areas that are not anywhere near the originally allocated portions of the file (a condition sometimes referred to as “file fragmentation”).
While thin provisioning combined with sparse mapping can result in efficient use of available storage resources which can be expanded as needed, rather than pre-allocated in bulk up front (sometimes referred to as “fat provisioning”), over time thin provisioning can result in significant file fragmentation. This fragmentation can result in the loss of any performance gains achieved by caching, and can even result in a performance penalty, wherein the system performs worse with caching enabled than with caching disabled. Such a performance penalty is due to the overhead associated with updating the cache each time old data is flushed from the cache and new data is read into the cache from the storage device (or written into the cache from a host device writing to the storage device).