1. Field of the Invention
The present invention relates generally to string matching techniques, and more specifically to an improved method and apparatus for finding, in a buffer, a target data string which matches a current data string to a length greater than or equal to all other target data strings which are located in the buffer prior to the current string. This improved method and apparatus may advantageously be used in a data compressor, for example.
2. Description of the Background
Data compression is useful in a variety of circumstances. Compressed data occupy less storage on a permanent storage medium. Compressed data take less time to transfer over a communication link. Compressed data in network packets have an is improved effective data content, increasing the effective packet content and the network throughput. Pages containing compressed data in a virtual memory system can be swapped in and out more quickly.
Data compressors take advantage of repeated strings in a series of data. Second and subsequent instances of a given string are represented by a relatively short reference to a previous instance of the string. Prior techniques, such as Lempel-Ziv compressors, maintain "dictionaries" of previously-encountered strings of bytes. In order to determine whether a given string has previously been encountered, prior techniques generally employ some sort of hashing function to access the dictionary. Hashing consumes computational resources, and slows down the compression of data.
The new method and apparatus of data compression and decompression described in the cross-referenced application "Fast Data Compressor with Direct Lookup Table Indexing into History Buffer" eliminates the need for a dictionary and also eliminates the need for hashing. A direct lookup table (DLT[]) is indexed or addressed by the first two bytes of a current data string in a history buffer (HB[]). A given entry in the DLT[] contains an offset or address in the HB[], at which a most recent prior string was found which began with the same two bytes with which the current string begins. When a previous matching string exists, the data compressor determines the length of the matching string, and outputs a vector reference back to the previous string. The DLT[] is then updated to reflect the newly-found current-string. A subsequent-string will later be referenced back to this new string, rather than to the previous string.
This method is extremely fast, offering a substantial improvement over its predecessors. More significantly, perhaps, is the fact that the disclosed decompressor can decompress the resulting compressed data at dramatically improved speeds, as is detailed in the cross-referenced application. However, the compressor of the cross-referenced application does not always obtain a theoretically maximum compression ratio.
For example, consider the string "ABCDABCABCD". According to the compressor of the cross-referenced application, the first "ABCD" bytes will be output as literal data, and the DLT[] entry addressed by "AB" (i.e. DLT["AB"]) will point to the first instance of "AB". Then, when the second instance of "ABC" is encountered, a vector will be outputted referring back to the first instance, with an indicated duplicate length of three (second "ABC" matches first "ABC"). The DLT["AB"] entry will be updated to point to the second, most recent instance of "AB". Subsequently, the third-instance of "ABC" will be found, and will be matched back against the second instance of "ABC", where the entry DLT["AB"] is then pointing. A vector will be outputted referring back to the second "ABC", indicating a length of three (third "ABC" matches second "ABC"). Finally, the last "D" will be output as a literal.
It is desirable that this last, literal "D" be avoided. What is needed, therefore, is an improved string matching apparatus and method which finds optimal previous target strings having maximum matching string lengths with a current string. For example, it is desirable that such an improved data compressor reference the second "ABCD" back to the first "ABCD", rather than merely referencing the third "ABC" back to the second "ABC" and then outputting an excess "D" literal.