Data compression is a technique that can be used when either storing or transmitting a block of data which contains some redundancy. By compressing such a block of data its effective size can be significantly reduced without reducing the amount of information that is carried by the particular data block. Data compression increases the density of information that is to be stored or communicated by reducing the amount of memory needed to store the block of data or the transmission time necessary to transmit such a block of data. There are three significant characteristics that are used to evaluate data compressors; how efficient the compressor is, how fast the compressor is, and can the compressor fully reproduce the block of data without introducing any error.
The efficiency of a data compressor is measured in a quantity called a compression ratio which is calculated by dividing the number of uncompressed characters by the number of compressed characters. The higher the compression ratio the greater the density of the compressed data. A compression ratio of 2 denotes that the number of characters after compression is half of the number of characters before compression.
Another important characteristic of a data compressor is how closely the output from the decompressor matches the original input. Compression techniques can be divided into two subdivisions, lossless and lossy. Lossless methods allow the exact reconstruction of the original data from the compressed data. Lossless methods are most appropriate for text compression applications or other applications where it is essential that the data be fully restored to its original condition. Lossy methods allow for some error to occur during compression and decompression. These types of methods are used where a good approximation is sufficient, such as on digitally sampled analog data.
The speed of a data compressor is also a very important characteristic to be considered. Devices that interact with a computer must be fast enough to allow the computer to function efficiently without creating a bottleneck in the system. In order to be beneficial to the system a data compressor must interface with the computer without slowing down its operation.
A common method of adaptive data compression is dictionary based compression. Dictionary based compression begins with an empty table of symbol strings and builds the table as the data is compressed so that the contents of the string table will reflect the characteristics of the particular data block. Using this method, a compression ratio above 1 can be achieved if the number of bits required to represent a symbol string is less than the average length of repeated symbol strings.
Two adaptive data compression schemes which construct a dictionary of codes representing unique strings of previous data symbols were proposed by Jacob Ziv and Abraham Lempel in 1977 and 1978. The first is known as LZ1 or LZ77. The LZ1 algorithm was analyzed and improved upon by T. C. Bell in his doctoral thesis entitled "A Unifying Theory and Improvements for Existing Approaches to Text Compression," Department of Computer Science, University of Canterbury, Christchurch, New Zealand, 1986. A summary of Bell's work can also be found in the book "Text Compression", by Bell Cleary & Witten, Prentice Hall, 1990. One of the most useful of Bell's enhancements to LZ1 is what is known as the LZB algorithm.
The LZB algorithm involves storing the last n symbols from the input data stream in a first-in-first-out (FIFO) buffer. The input data stream is compared to the contents of the buffer, and each time an input symbol string matches a string in the buffer, the symbol string is encoded as a code pair. The code pair consists of a first value representing the length of the string and a second value representing the string's position in the buffer.
The LZB data compression algorithm compares the input data stream to its previous n symbols stored in a FIFO buffer. Strings of input symbols that match strings in the buffer of at least two symbols in length are encoded as code pairs. Strings that do not are simply transmitted unaltered. Code pairs consist of two values, a length 1 and a position p. The length 1 represents the length in symbols of the match string, and the position p is the distance from the current input to the most recent instance of the match string in the input stream.
Table 1 illustrates the encoding of the stream of symbols RINTINTIN. The first four symbols, RINT, are not found in the buffer and are output unaltered. The next three symbols, INT, match the previous three at a displacement of -3. The final two input symbols, IN, continue to match at a displacement of -3, so that the last five symbols INTIN are encoded with the pair (5, -3). Note that the value of a code pair length may exceed that of the position.
Code pairs and unencoded bytes (i.e., when no match has occurred) can be further encoded using either fixed length or variable length codes. Bell's thesis shows that encoding the length, 1, and the position, p, of code pairs using variable length codes generally results in further compression. Bell also points out that a flag bit must be sent with each code pair or unencoded byte so that the decompressor can distinguish between them. The flag bits along with the fixed or variable length encodings will be referred to as the postcode.
TABLE 1 ______________________________________ LZ1 Compression Example Input Char Output Code ______________________________________ R "R" I "I" N "N" T "T" N (5,-3) T I N ______________________________________
Decompression also utilizes a FIFO buffer, however, no searching is necessary. Decoding of the output of the previous example is shown in Table 2. It is assumed that any postcode has already been decoded. The first four primitive symbols RINT are output unaltered and stored in the buffer. Next, the code pair (5, -3) which represents the last three symbols in the buffer INT plus the next two symbols, IN, that follow after the new INT string, is added to buffer.
TABLE 2 ______________________________________ LZ1 Decompression Example Input Output Char ______________________________________ "R" R "I" I "N" N "T" T (5,-3) INTIN ______________________________________
The second Lempel-Ziv algorithm was introduced in an article entitled "Compression of Individual Sequences via Variable Rate Coding", IEEE Transactions on Information Theory, Vol. 24, No. 5, pages 530-536 (September 1978). This method constructs a table or dictionary of symbol strings from the data as it is input to the compressor. Then the next time that a specific string is encountered, its corresponding dictionary index will be transmitted instead of the symbol string. This compression scheme is referred to as LZ78 or LZ2.
In 1984 Terry Welch proposed a variation on the LZ2 procedure in "A Technique For High-Performance Data Compression", IEEE Computer, Vol. 17, No. 6, pages 8-19 (June 1984). This data compression scheme is referred to as the LZW algorithm. It is organized around a table, made up of strings of characters, where each string is unique. Each string is referenced by a fixed length code which represents the longest matching string seen thus far in the previous input plus the one byte that makes this string different from prior strings. In U.S. Pat. No. 4,558,302 Terry Welch presents a hardware implementation of the LZW algorithm utilizing a Random Access Memory (RAM) and a limited search hashing procedure to search through the string table and enter extended strings in the random access memory.
Both algorithms are based on the search of a larger buffer or dictionary for the occurrence of previously encountered symbol strings. Previous implementations of LZ1 and LZ2 compressors utilized conventional static RAMs for buffer or dictionary storage and hash coding methods to accelerate the search process.
The first known implementation of a CAM based LZ2 system was constructed by Advanced Hardware Architectures, Inc. in 1991 and is the subject of U.S. patent application Ser. No. 07/924,293 filed on Aug. 3, 1992. An LZB system utilizing a shift register was proposed by Whiting and George in U.S. Pat. No. 5,003,307. In the Whiting system, the FIFO buffer required by the LZ1 algorithm is implemented by a hardware shift register of n words and n comparators, so that the current input symbol is compared with each word in the buffer concurrently. When multiple string matches are found in the buffer, the longest running match is identified by a controller state machine which accumulates the number of matches (corresponding to string length) for each entry in the buffer. When there is more than one instance of the longest matching string in the buffer, a priority encoding network determines the position of the most recent matching string.
Associated with each buffer word in the Whiting system is a flip-flop whose output, String(j), indicates whether the buffer string ending in location j has continually matched the current input string. At the beginning of each string search, all n flip-flops are preset to 1 by a StartString signal from the controller state machine. String(j) is asserted for each word in the array until a mismatch is detected in location j. The end of a symbol string is reached when all String(j) outputs have been disabled. This is reported to the controller state machine by a signal Match?, which is the logic OR of all String(j) signals. While the control logic for the StartString signal is not disclosed in the Whiting patent, presumably Match? is used to assert StartString for at least a portion of a clock cycle to initiate a search for a new string in the array.
In U.S. Pat. No. 5,016,009 to Whiting et al., another implementation of the LZB algorithm is presented. In this case, rather than utilizing a shift register means, a hashing table is used.
What is needed is an adaptive data compression system which utilizes an enhanced version of the LZ1 algorithm and uses a memory which does not require the data to be shifted and does not require a hashing table, but will allow compression or decompression to occur at a maximum rate of one symbol per clock cycle.