It is common to transfer large volume of data over a computer network, or between storage devices over an I/O (input/output) interface. For example, a user may transfer a whole home directory from a hard drive to a non-volatile memory device (e.g., a flash drive) to perform a periodic backup of the hard drive, or transfer a large document file over the Internet. The data transferred can include redundant data, i.e. data that the recipient already possesses. For example, in the case where the user is creating a periodic backup of the hard drive on the flash drive, the backup data to be transmitted to the flash drive typically contains data that already exists in the flash drive. Similarly, in the case where the user transfers the document file over the Internet, the user may be downloading the file from a network source (e.g., a server), modifying it, and uploading the file back to the network source. If the document file is not completely modified, common data can also exist between the version of the file uploaded and the version of the file downloaded. Transmitting redundant data that is stored in both the source and the destination leads to inefficient utilization of bandwidth of I/O interface and network. Existing compression and decompression methods fail to take advantage of such data redundancies, since locating redundant data over gigabytes to terabytes data storage is generally considered to be time-consuming and with low yield.
Hence, there is a need for a technique to search for redundant data with huge volume of data, in an efficient manner and with a high probability of locating the redundancies, which can minimize the transmission of redundant data and can improve the utilization of limited bandwidth of I/O interface and network.