As the speed and size of networked computer systems have continued to increase, so has the amount of data stored within, and exchanged between, such systems. While a great deal of effort has been focused on developing larger and more dense storage devices, as well as faster networking technologies, the continually increasing demand for storage space and networking bandwidth has resulted in the development of technologies that further optimize the storage space and bandwidth currently available on existing storage devices and networks. One such technology is data compression, wherein the data saved to a storage device, or transmitted across a network, is manipulated by software to reduce the total number of bytes required to represent the data, and thus reduce the storage and bandwidth required to store and/or transmit the data.
Data compression can be divided into two general categories: lossy data compression and lossless data compression. As the terms imply, lossy data compression (sometimes referred to as perceptual coding) allows for some loss of fidelity in the encoded information, while lossless data compression requires that the decompressed data must be an exact copy of the original data, with no alterations or errors. While lossy data compression may be suitable for applications that process audio, image and/or video data, a great many other data processing applications require the fidelity provided by lossless data compression.
Most existing lossless data compression techniques are iterative in nature, and generally are optimized for software implementations. These software-based lossless compression techniques are typically not well suited for use in applications requiring high speed/low latency data throughput, where even small processing delays may be unacceptable. Some hardware-based implementations do exist, but many such implementations process one byte at a time, and are thus limited to the clock frequency at which the hardware can be operated. Other hardware implementations are capable of processing multiple byes at one time, but these implementations do so at the expense of compression efficiency.
While data compression techniques attempt to address storage space and bandwidth concerns by reducing the amount of data that is stored on (and transmitted to and from) a storage device, other techniques attempt to address bandwidth concerns by limiting the number of times data is read from and written to the storage devices. One such technique is “caching,” wherein a copy of the desired data on the storage device is maintained in memory after an initial read or write, and subsequent accesses to the data are directed to the in-memory copy. While caching works well for data that is stored together in one area of a disk (e.g., within adjacent sectors) or related areas (e.g., different platters but within the same cylinder), wherein the data is retrieved within either a single access or a small number of sequential accesses with minimal repositioning of the read/write head of the storage device, it does not work well with data that is distributed over different areas within a storage device or even different storage devices. Such a distribution can occur in data that is heavily modified after its initial storage, particularly in systems that use “thin provisioning” combined with “sparse mapping.”
In systems that combine thin provisioning with sparse mapping, storage is virtualized and appears as being allocated when requested (e.g., by opening a file or creating a directory), but the actual physical storage is only allocated on an “as-needed”basis when the data is actually written to disk (i.e., allocated on an I/O-basis). Further, different files and file systems are sparsely distributed (i.e., mapped) over the logical block address space of the virtual disk (i.e., separated by large unused areas within the address space), but are sequentially allocated physically adjacent storage blocks on the physical disk. As a result, adjacent blocks on the physical disk can be associated with different files on the virtual disk. Further, as files are modified and expand, the additional file extents could be allocated anywhere on the physical disk, frequently within unrelated areas that are not anywhere near the originally allocated portions of the file (a condition sometimes referred to as “file fragmentation”).
While thin provisioning combined with sparse mapping can result in efficient use of available storage resources which can be expanded as needed, rather than pre-allocated in bulk up front (sometimes referred to as “fat provisioning”), over time thin provisioning can result in significant file fragmentation. This fragmentation can result in the loss of any performance gains achieved by caching, and can even result in a performance penalty, wherein the system performs worse with caching enabled than with caching disabled. Such a performance penalty is due to the overhead associated with updating the cache each time old data is flushed from the cache and new data is read into the cache from the storage device (or written into the cache from a host device writing to the storage device).
Lossless data compression can be performed at two different levels: 1) between blocks of data, wherein duplicate blocks of data are identified and replaced with a pointer to a single copy of the data block saved on the storage system; and 2) within a block of data, wherein duplicate byte sequences within a single block of data are identified and replaced with a pointer to a single copy of the sequence within the data block. As the system receives data to be stored on the storage system, the data is grouped into data blocks referred to as “chunks.” If all of the data within a chunk is identified as having already been stored onto the storage system, the descriptor of the object being stored is modified to point to the chunk already stored on the storage system, rather than to point to a new chunk that would needlessly store a duplicate copy of an existing chunk. Such elimination of duplicated chunks is referred to as “deduplication” (also sometimes referred to as “capacity optimization” or “single-instance storage”). Additional structures (described below) keep track of the number of references to the chunk, thus preventing its deletion until the last object referencing the chunk is deleted.
Although the elimination of duplicated blocks and of duplicated data within a block are both considered forms of lossless data compression, different terms are used herein for each in order to distinguish between the two forms of lossless compression. Thus, throughout the remainder of this disclosure the term “deduplication” is used to refer to the elimination of duplicate chunks by storing one instance of a chunk that is referenced by multiple occurrences of the chunk within a virtualized storage device. Further, the term “compression” is used throughout the disclosure to refer to the elimination of duplicate byte sequences within a chunk, and the term “decompression” is used to refer to the reconstruction or regeneration of the original data within a previously “compressed” chunk.
After data has been grouped into chunks, the chunks are generally forwarded to fingerprint and Bloom filters, where a fingerprint is generated to identify each chunk and is applied to the Bloom filter to determine if the chunk has already been stored onto a corresponding storage device. The chunk information is stored in a file that is often referred to as a dictionary, as it defines the various chunks. The information includes the boundaries, fingerprint and Bloom filter lookup results for each new chunk, and the location information for those chunks that already exist.
The same principals can be applied to data being transmitted over a WAN, as deduplication is desirable as it reduces the needed bandwidth of the data. The deduplication units can be provided at each end of the WAN to deflate and inflate the data. However, if multiple deduplication units are used at each end of a WAN connection, this could result in inefficient use of memory as each deduplication unit would have to maintain a full dictionary. As a result, the number of stored chunks would be reduced due to the effectively smaller amount of memory available for the dictionaries.
Thus, what is needed is an efficient method for using the dictionary in a network system without having to maintain a full dictionary for each deduplication engine.