Data de-duplication is a data compression technique for eliminating duplicate or repeating data from a data stream, and can be applied to network data transfers to reduce the number of bytes that must be sent. In the de-duplication process, unique chunks of data are identified and stored as historical data on both the transmit and receive sides of the network. Thereafter, incoming data is compared with the historical data on the transmit-side of the network, and when a redundant chunk is found, the redundant chunk is replaced with a reference indicator that points to the matching historical data. The receiver then uses the reference indicator to identify the matched historical data stored on the receiver side, which is used to replicate the duplicate data. Since the reference indicator is much smaller than the redundant chunk of data, and the same byte pattern may occur dozens, hundreds, or even thousands of times, the amount of data that must be stored or transferred can be greatly reduced.
The efficiency of a data de-duplication depends on a number of factors, including the algorithm used to store/match incoming data chunks with historical data. Conventional de-duplication techniques use a generic algorithm for all protocols, and hence the algorithm parameters are not tailored for specific protocols. Other factors include the length of the hash table, the length of the history, the size of the blocks used for generating hash values, etc. For instance, shorter hash tables generally allow for quicker searching, while longer histories typically allow for more frequent matching (since more historical data is available for de-duplication). Additionally, larger blocks provide for higher compression ratios (as larger for each matched blocks. However, smaller blocks provide for an increased likelihood that a match will be found. Other factors affecting the efficiency of data de-duplication include the amount of historical data stored in history, the method used for storing historical data, and the mechanism used to discard some of the old data to make room for newer data. New techniques and mechanisms for improving the efficiency of data de-duplication are desired.