Adaptive data transform algorithms are well known in the field of data compression, encryption and message digest generation. In particular, the “history buffer” versions of these adaptive data transform algorithms, for example the Lempel-Ziv 1 (or LZ1) compression algorithm, have become particularly popular in hardware implementations where their relatively modest buffer requirements and predictable performance make them a good fit for most underlying technologies.
The LZ1 algorithm works by examining the input string of characters and keeping a record of the characters it has encountered. Then, when a string appears that has occurred before in recent history, it is replaced in the output string by a “token”: a code indicating where in the past the string has occurred and for how long. Both the compressor and decompressor must use a “history buffer” of a defined length, but otherwise no more information need be passed between them.
Like many compression and other data transform algorithms, LZ1 describes the format of the compressed data, rather than how the compression should be performed. It is quite common for two or more LZ1 compressed data streams of different lengths to decompress to the same data; therefore any valid compressed data stream is not necessarily coded in its most efficient (i.e. most compressed) form. The same applies to data streams that have been encrypted using adaptive transform to increase the entropy of the information. In many cases, there are efficiencies to be gained by optimization of the overall length of the tokens used to encode the data.
Some variations in the basic LZ1 algorithm have emerged, in particular using variable-length tokens to improve coding efficiency. For the purposes of this description, the variation known as IBMLZ1 will be used, but any version of the LZ1 algorithm would serve equally well. It will be clear to one skilled in the data processing art that many adaptive data transforms for encryption and for message digest generation exhibit the same need for optimal economy in parsing and tokenizing their respective input data streams.
The traditional method of finding occurrences of input strings in a history buffer in, for example LZ1 compression, can be described as “greedy” parsing. This is because the conventional parsing method always prefers the longest candidate string for encoding.
For example, suppose the history buffer contains the words “consensus” and “contagious”, and a new string, “contact” appears for processing (as shown in FIG. 5). The first three letters, “con”, will be matched with both strings in the buffer, and both will be regarded as candidates for substitution.
But the fourth letter, “t” matches only with “contagious”, and so “consensus” is abandoned as a potential replacement pointer. The fifth letter, “a” also matches with “contagious”, but the match fails at the sixth, “c”. Thus the matched string terminates at this point, and a pointer to the string “conta” is substituted in the output stream. Thus the parser (the apparatus or process that compares input characters with the contents of the history buffer and finds the best match) has been greedy in using the longest string it could find.
If the parser processes input bytes one at a time, as is the case in conventional LZ1 processing, then the greedy algorithm is the best to use. The longer the encoded string, the greater the compression. However, another reason for describing it as greedy is that it has chosen the first string it could find.
Co-pending PCT patent application number WO/GB03/00384, assigned to the same assignee, describes a hardware method of implementing LZ1 compression that processes an indefinite number of bytes per cycle. A further refinement providing a reduced gate cost and capable of processing three bytes per cycle is disclosed in co-pending PCT patent application number WO/GB03/00388, assigned to the same assignee.
However, these and all the parsers presently known in the art employ the greedy algorithm described above. Although this algorithm is best in the single-byte situation, there are many circumstances in which it does not produce optimum compression.
The Applicant believes that it would be desirable to alleviate this problem by providing an improved parser capable of providing greater compression efficiency.