Computer search algorithms are used by many programs including data compression engines and database search engines. For example, an LZ77-based data compression algorithm transforms a stream of data into Huffman codes representing either a byte in the stream or a number of bytes which have previously appeared in the data stream and which are within a sliding history window of finite size. LZ77compression engines thus require a search engine to search previous locations in the data stream in order to find the largest and closest possible match, if any, with the current data string that is to be compressed.
One commonly-used search algorithm is the hash-chain search algorithm which linearly searches through a chain of buffer locations with the same hash value. In general, the hash chain search algorithm breaks a complete linear search into a number of smaller linear searches. The hash chain search algorithm provides acceptable results with short search windows, e.g., wherein the hash value is twelve to fifteen bits in length and the search window is limited in size to thirty-two to sixty-four kilobytes. However, with larger window sizes and/or hash values, the time required to hash chain search becomes significant, and becomes a substantial bottleneck in the compression process.
Binary search algorithms search for a data pattern by traversing a tree of nodes using one pointer to a subtree of all nodes smaller than the current node and another pointer to a subtree of nodes which are larger. Although binary searches can approach log(n) search times and are thus faster than linear searches in most cases dealing with large search windows, they are difficult to realize in many types of data compression encoders, including an LZ77 encoder. More particularly, inserting new nodes into a binary tree and/or deleting old nodes which exceed the window size requires a complete search of the tree and thus make the search costly. The overall cost is significant because once the encoder's input stream reaches the window size, every time the stream pointer is advanced, (and a node inserted into the binary search tree), a node must be deleted from the tree.
Moreover, in an LZ77-based encoder, finding the closest offset is statistically important for providing improved compression. Thus, an LZ77-based encoder seeks to locate the closest match of a certain length, but the ordering of the nodes in a binary search tree makes it difficult to do so. By way of example, consider the conventional binary search tree structure of FIG. 1, wherein the offset from the current string pointer is represented by the value in parentheses. Note that in the conventional binary search tree of FIG. 1, new nodes are inserted as leaves of the tree, and thus the most-recently inserted nodes, which represent strings having the smallest offsets, are located at the tree leaves. If a search commences beginning with the character string "CAD . . . ," the search progresses from the root "CAN . . . " to the left subtree of root "BAT . . . " and on to "CAB . . . " before the search is terminated by the leaf node. Match lengths of two ("CA") are thus found at offsets eighty (80) and sixty (60). However, there are two other strings in the tree which have a match length of two, namely, "CAT . . . " and "CAR . . . " at offsets seventy (70) and twenty (20), respectively. Thus, although a normal binary search finds the largest match length, the binary search does not necessarily find the largest match length with the lowest offset. Accordingly, such a search must be modified (e.g., nodes of the same match length are flagged and all possible subtrees paths with the same lengths searched until the closest offset is found) in order to be used with a proper LZ77-based encoder. As can be appreciated, such a modified search is complex and is often relatively slow, failing to approach log(n) performance.
In short, hash chain algorithms provide poor performance when searching large compression windows. At the same time, existing binary search algorithms have a number of drawbacks associated therewith that make using binary search trees for data compression purposes rather cumbersome.