This invention relates to string matching.
String matching, in which the characters of one data string are matched against characters of another data string, is useful in, e.g., data compression algorithms of the kind found in data communication systems using statistical multiplexers and modems.
Referring to FIG. 1, in a classic data communication system 10 using data compression techniques, data from a sender 12 undergoes compression 14, and the compressed data is transmitted via a communication medium 16 (in FIG. 1 assumed to be an error-free communication channel). At the other end of the channel, the compressed data is decompressed 20 to recover the original data stream.
In high-bandwidth communication networks the very substantial benefits of real-time data compression tend to be offset by the computational cost associated with the compression and decompression processes.
Among known data compression algorithms is the Ziv-Lempel '77 algorithm (ZL77 for short), which belongs to the class of variable-length input and fixed-length output (V-F class) data compression algorithms.
The ZL77 algorithm is based on the simple observation that, in a continuous data stream, some data occur more than once and in particular may occur more than once within a local region of the data stream. For example, in a stream of text data, the word "the" is likely to appear frequently. If the data source keeps track of the history of data recently sent (by storing it in a so-called history buffer), it can find redundant occurrences of a sequence of data by comparing successive elements of the current data to be sent with successive elements of the stored historical data. The process of comparison is called variable length string matching. When a match is found, the sender, instead of again sending the full redundant sequence, encodes that sequence as a codeword which points to the location of the earlier occurrence of the redundant data sequence in the history buffer and refers to its length. Data compression is achieved if the number of bits required to represent the codeword is less than the number of bits required to represent the redundant data sequence. At the other end of the channel a decoder, which similarly maintains a history buffer of recently sent data, the codeword is decoded by referring to the specified place in the history buffer.
Referring to FIG. 2, for example, a history buffer 11 has 16 cells (13), which are numbered to indicate the order in which the characters of the data stream have appeared (lower numbered cells hold more recent data characters.) Data characters waiting to be sent are shown to the right of the history buffer. The next six characters to be sent are S U P E R B. ZL77 determines that the first five of these waiting characters, S U P E R are redundant with a like string in the history buffer, and can be encoded as a codeword 15 consisting of Index, Length, and Innovation Character fields. Index 17 has a value of 12 indicating how many characters back in the history buffer the matching string begins; Length 19 has a value of 5 and shows the length in characters of the match; and Innovation Character 21 is the first character in the waiting input string of characters that did not match the string in the history buffer.
Referring to FIG. 3, after transmitting the codeword, the data source updates its history buffer by effectively sliding old data 23 to the left and inserting the recent input data into the right of the history buffer. The process then begins again with the data source encoding the next input data.
Referring to FIG. 4, the data receiver maintains a duplicate history buffer 25 and updates it in the same way as the sender updated its history buffer. Upon receiving a codeword, the receiver uses the Index field to find the location of the longest match, and the Length to determine how many characters to read from the history buffer, also taking the Innovation Character as it appears in the codeword. Then, as shown in FIG. 5, having decoded the codeword, the receiver updates its history buffer by effectively sliding the characters in the history buffer to the left and inserting the decoded characters and Innovation Character from the right.
One hardware architecture capable of implementing string matching algorithms is content addressable memory (CAM). CAM consists of a number of cells which can store data and access the cells by their contents. During a read cycle, CAM takes data as input and outputs the address where the data is found.