Storing large amounts of data efficiently, in terms of both time and space, is of paramount concern in the design of a backup and restore system, particularly where a large repository of digital data must be preserved. For example, a user or group of users might wish to periodically (e.g., daily or weekly) backup all of the data stored on their computer(s) to a repository as a precaution against possible crashes, corruption or accidental deletion of important data. It commonly occurs that most of the data, at times more than 99%, has not been changed since the last backup has been performed, and therefore much of the current data can already be found in the repository, with only minor changes. If this data in the repository that is similar to the current backup data can be located efficiently, then there is no need to store the data again, rather, only the changes need be recorded. This process of storing common data once only is known as data factoring.
A large-scale backup and restore system that implements factoring may have one petabyte (PB) or more in its repository. For example, banks that record transactions performed by customers, or Internet Service Providers that store email for multiple users, typically have a repository size ranging from hundreds of gigabytes to multiple petabytes. It may be recalled that 1 PB=1024 TB (terabyte), 1 TB=1024 GB (gigabyte), 1 GB=1024 MB (megabyte), 1 MB=1024 KB (kilobyte), 1 KB=1024 bytes. In other words, a petabyte (PB) is 250 bytes, or about 1015 bytes.
In such large systems, the input (backup) data stream to be added to the repository may be, for instance, up to 100 GB or more. It is very likely that this input data is similar to, but not exactly the same as, data already in the repository. Further, the backup data stream may not be arranged on the same data boundaries (e.g., block alignment) as the data already in the repository. In order to make a subsequent factoring step more efficient, the backup and restore system must be able to efficiently find the location of the data in the repository that is sufficiently similar to the input stream without relying on any relative alignment of the data in the repository and the data in the input stream. The backup and restore system must also be able to efficiently add the input stream to the repository and remove from the repository old input streams that have been deleted or superseded.
Generally, it can be assumed that data changes are local. Thus, for instance, if 1% of the data has been changed, then such changes are concentrated in localized areas and in those areas there are possibly major changes, while the vast majority of the data areas have remained the same. Typically (although not necessarily) if, for example, 1% of the data has changed, then viewing the data as a stream of 512-byte blocks rather than as a stream of bytes, a little more than 1% of the blocks have changed. However, because there is no predetermined alignment of the data in the input stream and repository, finding the localized data changes is a significant task.
Searching for similar data may be considered an extension of the classical problem of pattern matching, in which a text T of length n is searched for the appearance of a string P of length m. Typically, text length n is much larger than search string length m. Many publications present search methods which attempt to solve this problem efficiently, that is, faster than the naïve approach of testing each location in text T to determine if string P appears there. By preprocessing the pattern, some algorithms achieve better complexity, for example see:                Knuth D. E., Morris J. H., Pratt V. R., Fast pattern matching in strings, SIAM Journal on Computing 6 (1977) 323-350.        Boyer R. S., Moore J. S., A fast string searching algorithm, Communications of the ACM 20 (1977) 762-772.        Karp R., Rabin M., Efficient randomized pattern matching algorithms, IBM Journal of Research and Development 31 (1987) 249-260.        
All of these algorithms work in time that is of order O(n+m), which means that the search time grows linearly with the size of text. One problem with these algorithms is that they are not scalable beyond some restrictive limit. For example, if searching a 1 GB text (the size of about 300 copies of the King James Bible) can be done in 1 second, searching a one Petabyte text would require more than 12 days of CPU time. A backup and restore system with one Petabyte (PB) or more in its repository could not use such an algorithm. Another disadvantage of the above algorithms is that they announce only exact matches, and are not easily extended to perform approximate matching.
Instead of preprocessing the pattern, one may preprocess the text itself, building a data structure known as a suffix tree; this is described in the following publications:                Weiner P., Linear pattern matching algorithm, Proceedings of the 14th IEEE Symposium on Switching and Automata Theory, (1973) 1-11.        Ukkonen E., On-line construction of suffix trees, Algorithmica 14(3) (1995) 249-260.        
If preprocessing is done off-line, then the preprocessing time may not be problematic. Subsequent searches can be then performed, using a suffix tree, in time O(m) only (i.e., depending only on the pattern size, not on the text size). But again, only exact matches can be found; moreover, the size of the suffix tree, though linear in the size of the text, may be prohibitive, as it may be up to 6 times larger than the original text.
For backup and restore, it would be desirable to use an algorithm for approximate pattern matching because it will usually be the case that not an exact replica of the input data can be found in the repository, but rather a copy that is strictly speaking different, but nevertheless very similar, according to some defined similarity criterion. Approximate pattern matching has been extensively studied, as described in:                Fischer M. J., Paterson M. S., String matching and other products, in Complexity of Computation, R. M. Karp (editor), SIAM-AMS Proceedings 7 (1974) 113-125.        Landau G. M., Vishkin U., Fast parallel and serial approximate string matching, Journal of Algorithms 10(2) (1989) 157-169.        Navarro G., A Guided Tour to Approximate String Matching, ACM Computing Surveys, 33(1) (2001) 31-88.        
One recent algorithm works in time O(n√{square root over (k log k)}), where n is the size of the text and k is the number of allowed mismatches between the pattern and the text; see for example:                Amir A., Lewenstein M., Porat E., Faster algorithms for string matching with k mismatches, Journal of Algorithms 50(2) (2004) 257-275.        
For large-scale data repositories, however, O(n√{square root over (k log k)}) is not an acceptable complexity. An input data stream entering the backup and restore system may be, for instance, of length up to 100 GB or more. If one assumes that an almost identical copy of this input stream exists in the repository, with only 1% of the data changed, there are still about 1 GB of differences, that is k=230 bytes. To find the locations of approximate matches in the repository, this algorithm will consume time proportional to about 180,000 times the size of the text n. This is unacceptable where our premise is text length n alone is so large, that an algorithm scanning the text only once, may be too slow.
Another family of algorithms is based on hashing functions. These are known in the storage industry as CAS (Content Addressed Storage), as described in:                Moulton G. H., Whitehill S. B., Hash file system and method for use in a commonality factoring system, U.S. Pat. No. 6,704,730.        
The general paradigm is as follows: The repository data is broken into blocks, and a hash value, also called a fingerprint or a signature, is produced for each block; all of these hash values are stored in an index. To locate some given input data, called the version, the given input data is also broken into blocks and the same hash function (that has been applied to the repository blocks) is applied to each of the version blocks. If the hash value of a version block is found in the index, a match is announced.
The advantage of CAS over the previous methods is that the search for similar data is now performed on the index, rather than on the repository text itself, and if the index is stored using an appropriate data structure, the search time may be significantly reduced. For instance, if the index is stored as a binary tree, or a more general B-tree, the search time will only be O(log (n/s)), where n is the size of the text, and s is the size of the blocks. If the index is stored in a sorted list, an interpolation search of the sorted list has an expected time of O(log (log(n/s))). If the index is stored in a hash table, the expected time could even be reduced to O(1), meaning that searching the index could be done in a constant expected time, in particular in time independent of the size of the repository text.
There are, however, disadvantages to this scheme. As before, only exact matches are found, that is, only if a block of input data is identical to a block of repository data will a match be announced. One of the requirements of a good hash function is that when two blocks are different, even only slightly, the corresponding hash values should be completely different, which is required to assure a good distribution of the hash values. But in backup and restore applications, this means that if two blocks are only approximately equal, a hashing scheme will not detect their proximity. Searching in the vicinity of the found hash value will also not reveal approximate matches. Moreover, an announced match does not necessarily correspond to a real match between two blocks: a hash function h is generally not one-to-one, so one can usually find blocks X and Y such that X≠Y and h(X)=h(Y).
Still further, the bandwidth requirements needed for repository updates and the transmission of data over a network also present opportunities for improvement.
These problems create a dilemma of how to choose the size s of the blocks: if a large block size is chosen, one achieves a smaller index (since the index needs to store n/s elements) and the probability of a false match is reduced, but at the same time, the probability of finding a matching block is reduced, which ultimately reduces the compression ratio (assuming the hashing function is used in a compression method, which stores only non-matching blocks, and pointers to the matching ones). If, on the other hand, a small block size is chosen, the overall compression efficiency may increase, but the probability of a false match also increases, and the increased number of blocks may require an index so large that the index itself becomes a storage problem.
In summary, many elegant methods have been suggested to address these problems, but they all ultimately suffer from being not scalable, in reasonable time and space, to the amount of data in a large sized data repository.