The amount of data being stored continues to expand. The amount of data being transmitted from place to place also continues to expand. Both storage to store data and bandwidth to transmit data are limited resources and thus can be expensive. Therefore techniques like deduplication have developed to address both space and bandwidth usage.
Space-centric deduplication addresses reducing the amount of storage required to store data by reducing the number of times duplicate data is stored. Bandwidth-centric deduplication addresses reducing the bandwidth required to transmit data by reducing the number of times duplicate data is transmitted. While there are similarities between space-centric deduplication and bandwidth-centric deduplication, there are also differences that have maintained boundaries between these two deduplication (dedupe) approaches. These boundaries lead to different apparatus being used for space-centric dedupe and bandwidth-centric dedupe, lead to different algorithms being used for space-centric dedupe and bandwidth-centric dedupe, and lead to different event horizons being present in space-centric dedupe and bandwidth-centric dedupe.
Space-centric dedupe may be performed as part of a replication process, as part of a backup process, or as part of another process that generally has an adequate time period to complete. This time period may be measured in hours or even days. Space-centric dedupe may be performed using a post-processing model where data is deduplicated after it is stored. When given adequate time and computing resources, space-centric dedupe can produce significant reductions in the amount of storage required to store data because duplicate data can be found and removed.
Bandwidth-centric dedupe may be performed as part of a data transmission process, as part of a data communication process, or as part of another process that generally has a very short time window to complete. This time window may be measured in milliseconds. Even though only milliseconds are available and even though limited computing resources may be available, bandwidth-centric dedupe can still yield reductions in bandwidth consumption. Bandwidth-centric dedupe may be performed using an inline model where data is deduplicated before it is stored.
In one example, bandwidth-centric dedupe may be performed in a wide area network (WAN) accelerator. More generally, bandwidth-centric dedupe is performed in a communication accelerating device, though few people may refer to this more general device, preferring to refer to a WAN Accelerator (WANAX). A major challenge for bandwidth-centric dedupe concerns the amount of duplicate data that can be stored in a WANAX and the amount of data that can be examined in a communication-relevant time frame by a WANAX. These limitations result in a WANAX having a limited dedupe horizon. The dedupe horizon can be limited by the amount of memory available to store data that can be compared to data transiting the WANAX, the amount of memory available to index data that can be compared to data transiting the WANAX, processing power available to compare in different ways stored data to data transiting the WANAX, and by the maximum delay acceptable for data transiting the WANAX. The horizon may therefore limit bandwidth-centric dedupe to comparing data transiting the WANAX to a small finite number of recently sent out messages, and to comparing data transiting the WANAX to a small finite amount of recently seen data. The horizon may therefore be limited to data that has been encountered in the last few minutes or seconds, to data that has been encountered in the last dozens of messages, or to the last few megabytes of data that has been encountered. Additionally, since a bandwidth-centric device may only have time to look at data from one point of view, the horizon may be limited by the perspective from which it is seen.
A post-processing dedupe device can alter the order in which it processes data while an inline device may be more limited in its ability to alter the order in which it processes data. Generally, an inline device may process data in the order in which it is received. This may limit the amount of processing that may be achieved per unit of time. Additionally, an inline device may be encountering multiple independent sets of data associated with different traffic flows. These traffic flows and sets of data may be logically distinct. Therefore, even though an individual flow or set theoretically might dedupe temporarily with respect to itself, the limited resources of the inline device may not be able to take advantage of temporal relationships that are lost due to the intermingling of the separate flows.
Space-centric dedupe may be performed, for example, in an apparatus referred to as a replication server or a dedupe device (DD). More generally, space-centric dedupe is performed in a computing device that has a processor, a memory, and an interface connecting the processor to the memory and to a set of logics. The logics may provide access to a chunking process, the chunk pool that results from the chunking process, a fingerprinting (e.g., hashing) process, an indexing process, the index that results from the indexing process, and storage for storing the chunk pool. The storage may be a single device or may take the form of a set of co-operating devices arranged in a single tier or in multiple tiers. A challenge for space-centric dedupe concerns the amount of time that it takes to examine data to root out as much duplication as possible. Another challenge for space-centric dedupe concerns the organization and placement of an index into the chunk pool which in turn creates a challenge concerning the amount of time it takes to determine whether a chunk of data is a duplicate chunk of data that already resides in the chunk pool.