1. Field
The described aspects relate to identifying potentially similar content for performing or enabling data reduction.
2. Background
Data reduction or compression techniques may be utilized to reduce the amount of data in a piece of content, such as a digital document or file, for improving the efficiency in the transfer or storage of the content. Data compression may be utilized in applications such as file transfer, file synchronization, content storage de-duplication, or any application where minimizing the size of the data is desirable.
In one specific example, data compression is utilized in the transfer of documents between two or more locations, referred to as “file transfer.” Because the communications links between the locations may have low bandwidth or high latency, or both, the time it takes to transfer the documents can be significant. Alternatively, even with a fast network, the file transfer may take a long time if the files have a large size, or if many files are being sent. By utilizing data compression techniques, the amount of data that needs to be transmitted can be reduced, thus reducing the transmission time. Further, a reduction in the amount of data to be transmitted will reduce the sum total of the amount of bandwidth required for the transmission, and thus free-up bandwidth for other types of communication.
There are a number of different data compression techniques, including compressing a file based on the same data content being already known. For example, these techniques may compare data content in a single file, among a plurality of files to be transferred, and between one or more files to be transferred and a plurality of files known by the destination or otherwise known in the system. In general, the focus of the existing solutions is on calculating the “distance” or “difference” between files or documents using “document fingerprinting” with hashing algorithms applied to sections of the file or document. Further, with each document represented by a collection of “document fingerprints or hashes, then the existing solutions attempt to find similarities between the fingerprints as a way to sift through the universe of documents that are known.
However, applying hashing/fingerprinting to a large universe of documents in an efficient manner, such as in terms of CPU, memory or disk utilization and overall execution time, is infeasible in many cases, especially when working with very large documents and/or a very large number of documents, such as 1000's or 100,000's of documents, and/or when the transfer is time-critical or involves a CPU/memory constraint.
Thus, improved systems are desired for efficiently reducing the potential set of similar documents that are used as inputs to algorithms for reducing the amount of data in content to be transferred or stored.