Data compression refers to the process of reducing the amount of data needed to represent a given information. The underlying basis of the reduction process is the removal of redundant or unnecessary data. Data compression techniques reduce the costs for information storage and transmission. Data compression techniques are used in many applications, ranging from simple file size reduction to speech and video encoding.
There are two different types of compression: lossless and lossy. In lossless compression, the source message at the encoder input is retrieved exactly at the output of the decoder. In lossy compression, the message is not retrieved exactly, but the information loss is tolerable for the type of application targeted by the compression schemes. Lossy compression is mainly used for speech, audio, image and video signals. The aim of the compression algorithm is to represent the signal with a minimum number of bits while maintaining the signal intelligibility and perceptual quality. All the information that cannot be perceived by human sensors can be removed.
Lossless compression techniques are used in applications where no information loss is tolerable such as compressing executable and source code files, satellite imaging and medical imaging. The techniques are also used as part of lossy compression schemes for better compression ratios.
One well-known technique for performing lossless compression is the LZW Lempel-Ziv-Welch (“LZW”) algorithm. The LZW algorithm is a universal algorithm based on string parsing according to a fixed rule. It is based on the concept that often used sequences can be encoded in a lesser number of bits than would be required to spell out the entire sequence. The LZW algorithm requires the initialization of a table with the alphabet of the source. A symbol width is selected and the source alphabet is created for the symbols and stored in a coding table in the encoder and the decoder before the start of the encoding process. The LZW algorithm adds selected sequences of symbols (vocabulary) to a dictionary as it encodes received sequences of symbols. Sequences contained in the dictionary can be encoded with a lesser number of bits than that required to spell out the entire sequence with symbols. The size of the source alphabet is dependent on the width of the symbol. For example, a symbol width of 1 byte (8 bits) requires a source alphabet of 28 (256) entries and a symbol width of 2 bytes (16 bits) requires a source alphabet of 216 (64K) entries. Typically, the LZW algorithm is implemented with a symbol width of one byte (8 bits). The LZW algorithm searches the coding table for the longest match in a received sequence of symbols and transmits the index of the longest match stored in the dictionary.
FIG. 1 illustrates a prior art LZW coding table 100 in an encoder and decoder for performing lossless data compression. The LZW coding table 100 can be a ternary Contents Addressable Memory (“CAM”). The input sequence of symbols 102 is translated to a sequence of indexes by the encoder using the source alphabet 106 and dictionary 108 stored in the LZW coding table 100. The coding table 100 in the encoder 110 and the decoding table 120 in decoder 112 include the source alphabet 106 and dictionary 108. The sequence of indexes 114 is transmitted by the encoder 110 and decoded by the decoder 112. The decoder 112 provides an output string 104 with the same symbols as the input sequence of symbols 102. The source alphabet 106 is stored in the LZW coding table 100 in the encoder 110 and the decoder 112 before the encoder 110 starts to encode the input sequence of symbols 102. The sequence of indexes 114 transmitted from the encoder 110 to the decoder 112 are indexes of plain text symbols stored in the source alphabet 106 or indexes of strings of symbols stored in the dictionary 108. The encoder 110 and the decoder 112 independently create entries in their respective dictionaries by learning new sequences of symbols dependent on the initial source alphabet. The encoder 110 adds a new sequence of symbols in the dictionary but transmits the index of the previously learned symbols or sequence of symbols to the decoder 112 in the sequence of indexes 114. The decoder also learns the new sequence of symbols and stores the new sequence of symbols at a new index in the LZW decoding table 120 in the dictionary 108.
FIG. 2 illustrates a prior art LZW compression of an input string in the encoder 110 shown in FIG. 1. The source alphabet 106 (FIG. 1) is stored in the LZW coding table 100 (FIG. 1) before the encoder 110 (FIG. 1) starts parsing the input sequence of symbols 102 or before the decoder starts decoding. The source alphabet 106 (FIG. 1) for an 8-bit symbol is stored at indexes 0-255 in the coding table 100 (FIG. 1) and the decoding table 120 (FIG. 1). The contents of five of the 256 locations in the source alphabet 106 (FIG. 1) are shown. Symbol ‘/’ is stored at index 47, symbol ‘b’ is stored at index 98, symbol ‘d’ is stored at index 100, symbol ‘e’ is stored at index 101, symbol ‘t’ is stored at index 116 and symbol ‘w’ is stored at index 119. An additional entry 256 at index 256 in the source alphabet 106 in the LZW coding table 100 stores End Of String (“EOS”), and entry 257 at index 257 in the dictionary 108 in the LZW coding table 100 stores a Flush code.
An input sequence of symbols 102 is received by the encoder 110 (FIG. 1). The encoder 110 (FIG. 1) parses the input sequence of symbols 102 and transmits the sequence of indexes 114 (FIG. 1). The input sequence of symbols 102 is encoded by the encoder 110 (FIG. 1) by parsing the input sequence of symbols 102 and searching the LZW coding table 100 for the longest match for the symbols and transmitted as a sequence of indexes (code words) for entries in the LZW coding table 100. An index can be a pointer to an entry in the source alphabet 106 or the dictionary 108.
As shown in the LZW coding table 100, the index for the entry in the source alphabet 106 storing the symbol ‘/’is 47. Initially, the coding table 100 stores only the source alphabet 106. As a sequence of symbols 102 is received by the encoder 110 (FIG. 1), the encoder 110 (FIG. 1) parses the sequence of symbols 102 dependent on the symbol width. The encoder 110 (FIG. 1) selects a symbol in the sequence of symbols 102 (FIG. 1) and searches the LZW coding table 100 for the symbol. The encoder learns vocabulary by concatenating known symbols and sequences of symbols. If the symbol is found, the symbol is concatenated with the next symbol, and the LZW coding table 100 is searched for a sequence of symbols formed by the two symbols. If the sequence of symbols is not stored in the LZW coding table 100, the index of the previously identified symbol or sequence of symbols is transmitted and the new sequence of symbols is added to the LZW coding table 100.
The operation of the encoder using the LZW algorithm is illustrated using the input sequence of symbols 102: /wed/we/wee/web/wet/ as shown in FIG. 2 and a symbol width of one character (8 bits). The coding table 100 stores the initial alphabet which includes an entry for each 8-bit symbol including ‘/’, ‘w’, ‘e’, ‘d’, ‘b’ and ‘t’.
The parsing of the input sequence of symbols 102 starts with symbol ‘/’. Symbol ‘/’ is stored in the LZW coding table 100 at index 47, ‘/’ is concatenated with the next symbol ‘w’, and the coding table is searched for the sequence of symbols ‘/w’; since ‘/w’ is not then stored in the LZW coding table 100, ‘/w’ is learned by storing ‘/w’ at the next sequential index 258. The index for ‘/’; that is, 47 the previously identified symbol is transmitted in the sequence of indexes 104.
Parsing starts again at symbol ‘w’ in the input sequence of symbols 102. The LZW coding table 100 is searched for symbol ‘w’. Symbol ‘w’ is stored in the LZW coding table 100 at index 119, symbol ‘w’ is concatenated with the next symbol ‘e’ in the input sequence of symbols 102 and coding table is searched for the sequence of symbols ‘we’. Since ‘we’ is not then stored in the coding table 100, ‘we’ is learned by storing ‘we’ in the coding table at the next sequential index 259. The index for ‘w’; that is, 119 the previously identified symbol is transmitted in the sequence of indexes 104.
Parsing starts again at symbol ‘e’ in the input sequence of symbols 102. Symbol ‘e’ is stored in the LZW coding table 100 at index 101. Symbol ‘e’ is concatenated with the next symbol ‘d’ in the input sequence of symbols and the LZW coding table 100 is searched for the sequence of symbols ‘ed’. Since, ‘ed’ is not stored in the LZW coding table 100, ‘ed’ is learned by storing ‘ed’ in the next sequential entry in the coding table at index 260. The index for ‘e’, that is, 101, the previously identified symbol is transmitted in the sequence of indexes 104.
Parsing starts again at symbol ‘d’ in the input sequence of symbols 102. Symbol ‘d’ is stored in the coding table 100 at index 100. Symbol ‘d’ is concatenated with the next symbol ‘/’ in the input sequence of symbols 102 and the LZW coding table 100 is searched for the sequence of symbols ‘d/’. Since, ‘d/’ is not stored in the LZW coding table 100, ‘d/’ is learned by storing ‘d/’ in the next sequential entry in the coding table at index 261. The index for the previously identified symbol ‘d’, 100, is transmitted in the sequence of indexes 104.
Parsing starts again from the symbol ‘/’ in the input sequence of symbols 102. Symbol ‘/’ is stored in the LZW coding table 100 at index 47. Symbol ‘/’ is concatenated with the next symbol ‘w’ in the input sequence of symbols 102 and the LZW coding table 100 is searched for the sequence of symbols ‘/w’. ‘/w’ is stored in the coding table 100 at index 258, ‘/w’ is concatenated with ‘e’ in the input sequence of symbols 102 and the coding table is searched for the sequence of symbols ‘/we’. Since ‘/we’ is not stored in the coding table 100, ‘/we’ is learned by storing ‘/we’ in the next sequential entry in the LZW coding table at index 262. The index for the previously identified sequence of symbols ‘/w’, 258, is transmitted in the sequence of indexes 104.
For example, for a symbol width of 8 bits, upon finding a match for the 24 bit (3 bytes×8 bits) per byte string of characters ‘/we’, a 9-bit index (the address of the string of symbols ‘/we’ stored in the dictionary) is transmitted from the encoder to the decoder. This reduces the number of bits transmitted from 24 to 9. Upon receiving the 9-bit index the decoder regenerates the string of characters ‘/we’ stored at the 9-bit index in its copy of the dictionary. If no corresponding sequence (prefix) had been found in the dictionary, indexes for the individual symbols ‘/’, ‘w’ and ‘e’ would be transmitted.
Transmission of the input sequence of twenty symbols 102 requires 160 bits (20 symbols×8 bits per symbol). The LZW algorithm reduces the number of bits transmitted to 126 bits (14 indices×9 bits). As the input sequence of symbols 102 is parsed, the vocabulary stored in the dictionary 108 in the coding table 100 increases and the lengths of the sequences of symbols stored in the coding table increase. For example, index 264 represents a sequence of four symbols ‘/wee’.
The longer the sequence of symbols stored in the coding table, the better the compression because the number of indexes transmitted is decreased. Compression can also be improved by increasing the symbol width. However, the initial source alphabet required by the LZW algorithm increases by two for each bit added to the symbol width and thus requires an impractical table size for encoding an initial source alphabet for symbol widths of several bytes.