Data compression refers to the process of reducing the amount of data needed to represent a given information. The underlying basis of the reduction process is the removal of redundant or unnecessary data. Data compression techniques reduce the costs for information storage and transmission. Data compression techniques are used in many applications, ranging from simple file size reduction to speech and video encoding.
There are two different types of compression: lossless and lossy. In lossless compression, the source message at the encoder input is retrieved exactly at the output of the decoder. In lossy compression, the message is not retrieved exactly, but the information loss is tolerable for the type of application targeted by the compression schemes. Lossy compression is mainly used for speech, audio, image and video signals. The aim of the compression algorithm is to represent the signal with a minimum number of bits while maintaining the signal intelligibility and perceptual quality. All the information that cannot be perceived by human sensors can be removed.
Lossless compression techniques are used in applications where no information loss is tolerable such as compressing executable and source code files, satellite imaging and medical imaging. The techniques are also used as part of lossy compression schemes for better compression ratios.
One well-known technique for performing lossless compression is the LZW Lempel-Ziv-Welch (xe2x80x9cLZWxe2x80x9d) algorithm. The LZW algorithm is a universal algorithm based on string parsing according to a fixed rule. It is based on the concept that often used sequences can be encoded in a lesser number of bits than would be required to spell out the entire sequence. The LZW algorithm requires the initialization of a table with the alphabet of the source. A symbol width is selected and the source alphabet is created for the symbols and stored in a coding table in the encoder and the decoder before the start of the encoding process. The LZW algorithm adds selected sequences of symbols (vocabulary) to a dictionary as it encodes received sequences of symbols. Sequences contained in the dictionary can be encoded with a lesser number of bits than that required to spell out the entire sequence with symbols. The size of the source alphabet is dependent on the width of the symbol. For example, a symbol width of 1 byte (8 bits) requires a source alphabet of 28 (256) entries and a symbol width of 2 bytes (16 bits) requires a source alphabet of 216 (64K) entries. Typically, the LZW algorithm is implemented with a symbol width of one byte (8 bits). The LZW algorithm searches the coding table for the longest match in a received sequence of symbols and transmits the index of the longest match stored in the dictionary.
FIG. 1 illustrates a prior art LZW coding table 100 in an encoder and decoder for performing lossless data compression. The LZW coding table 100 can be a ternary Contents Addressable Memory (xe2x80x9cCAMxe2x80x9d). The input sequence of symbols 102 is translated to a sequence of indexes by the encoder using the source alphabet 106 and dictionary 108 stored in the LZW coding table 100. The coding table 100 in the encoder 110 and the decoding table 120 in decoder 112 include the source alphabet 106 and dictionary 108. The sequence of indexes 114 is transmitted by the encoder 110 and decoded by the decoder 112. The decoder 112 provides an output string 104 with the same symbols as the input sequence of symbols 102. The source alphabet 106 is stored in the LZW coding table 100 in the encoder 110 and the decoder 112 before the encoder 110 starts to encode the input sequence of symbols 102. The sequence of indexes 114 transmitted from the encoder 110 to the decoder 112 are indexes of plain text symbols stored in the source alphabet 106 or indexes of strings of symbols stored in the dictionary 108. The encoder 110 and the decoder 112 independently create entries in their respective dictionaries by learning new sequences of symbols dependent on the initial source alphabet. The encoder 110 adds a new sequence of symbols in the dictionary but transmits the index of the previously learned symbols or sequence of symbols to the decoder 112 in the sequence of indexes 114. The decoder also learns the new sequence of symbols and stores the new sequence of symbols at a new index in the LZW decoding table 120 in the dictionary 108.
FIG. 2 illustrates a prior art LZW compression of an input string in the encoder 110 shown in FIG. 1. The source alphabet 106 (FIG. 1) is stored in the LZW coding table 100 (FIG. 1) before the encoder 110 (FIG. 1) starts parsing the input sequence of symbols 102 or before the decoder starts decoding. The source alphabet 106 (FIG. 1) for an 8-bit symbol is stored at indexes 0-255 in the coding table 100 (FIG. 1) and the decoding table 120 (FIG. 1). The contents of five of the 256 locations in the source alphabet 106 (FIG. 1) are shown. Symbol xe2x80x98/xe2x80x99 is stored at index 47, symbol xe2x80x98bxe2x80x99 is stored at index 98, symbol xe2x80x98dxe2x80x99 is stored at index 100, symbol xe2x80x98exe2x80x99 is stored at index 101, symbol xe2x80x98txe2x80x99 is stored at index 116 and symbol xe2x80x98wxe2x80x99 is stored at index 119. An additional entry 256 at index 256 in the source alphabet 106 in the LZW coding table 100 stores End Of String (xe2x80x9cEOSxe2x80x9d), and entry 257 at index 257 in the dictionary 108 in the LZW coding table 100 stores a Flush code.
An input sequence of symbols 102 is received by the encoder 110 (FIG. 1). The encoder 110 (FIG. 1) parses the input sequence of symbols 102 and transmits the sequence of indexes 114 (FIG. 1 ). The input sequence of symbols 102 is encoded by the encoder 110 (FIG. 1) by parsing the input sequence of symbols 102 and searching the LZW coding table 100 for the longest match for the symbols and transmitted as a sequence of indexes (code words) for entries in the LZW coding table 100. An index can be a pointer to an entry in the source alphabet 106 or the dictionary 108.
As shown in the LZW coding table 100, the index for the entry in the source alphabet 106 storing the symbol xe2x80x98/xe2x80x99 is 47. Initially, the coding table 100 stores only the source alphabet 106. As a sequence of symbols 102 is received by the encoder 110 (FIG. 1), the encoder 110 (FIG. 1) parses the sequence of symbols 102 dependent on the symbol width. The encoder 110 (FIG. 1) selects a symbol in the sequence of symbols 102 (FIG. 1) and searches the LZW coding table 100 for the symbol. The encoder learns vocabulary by concatenating known symbols and sequences of symbols. If the symbol is found, the symbol is concatenated with the next symbol, and the LZW coding table 100 is searched for a sequence of symbols formed by the two symbols. If the sequence of symbols is not stored in the LZW coding table 100, the index of the previously identified symbol or sequence of symbols is transmitted and the new sequence of symbols is added to the LZW coding table 100.
The operation of the encoder using the LZW algorithm is illustrated using the input sequence of symbols 102: /wed/we/wee/web/wet/ as shown in FIG. 2 and a symbol width of one character (8 bits). The coding table 100 stores the initial alphabet which includes an entry for each 8-bit symbol including xe2x80x98/xe2x80x99, xe2x80x98wxe2x80x99, xe2x80x98exe2x80x99, xe2x80x98dxe2x80x99, xe2x80x98bxe2x80x99 and xe2x80x98txe2x80x99. The parsing of the input sequence of symbols 102 starts with symbol xe2x80x98/xe2x80x99. Symbol xe2x80x98/xe2x80x99 is stored in the LZW coding table 100 at index 47, xe2x80x98/xe2x80x99 is concatenated with the next symbol xe2x80x98wxe2x80x99, and the coding table is searched for the sequence of symbols xe2x80x98/wxe2x80x99; since xe2x80x98/wxe2x80x99 is not then stored in the LZW coding table 100, xe2x80x98/wxe2x80x99 is learned by storing xe2x80x98/wxe2x80x99 at the next sequential index 258. The index for xe2x80x98/xe2x80x99; that is, 47 the previously identified symbol is transmitted in the sequence of indexes 104.
Parsing starts again at symbol xe2x80x98wxe2x80x99 in the input sequence of symbols 102. The LZW coding table 100 is searched for symbol xe2x80x98wxe2x80x99. Symbol xe2x80x98wxe2x80x99 is stored in the LZW coding table 100 at index 119, symbol xe2x80x98wxe2x80x99 is concatenated with the next symbol xe2x80x98exe2x80x99 in the input sequence of symbols 102 and coding table is searched for the sequence of symbols xe2x80x98wexe2x80x99. Since xe2x80x98wexe2x80x99 is not then stored in the coding table 100, xe2x80x98wexe2x80x99 is learned by storing xe2x80x98wexe2x80x99 in the coding table at the next sequential index 259. The index for xe2x80x98wxe2x80x99; that is, 119 the previously identified symbol is transmitted in the sequence of indexes 104.
Parsing starts again at symbol xe2x80x98exe2x80x99 in the input sequence of symbols 102. Symbol xe2x80x98exe2x80x99 is stored in the LZW coding table 100 at index 101. Symbol xe2x80x98exe2x80x99 is concatenated with the next symbol xe2x80x98dxe2x80x99 in the input sequence of symbols and the LZW coding table 100 is searched for the sequence of symbols xe2x80x98edxe2x80x99. Since, xe2x80x98edxe2x80x99 is not stored in the LZW coding table 100, xe2x80x98edxe2x80x99 is learned by storing xe2x80x98edxe2x80x99 in the next sequential entry in the coding table at index 260. The index for xe2x80x98exe2x80x99, that is, 101, the previously identified symbol is transmitted in the sequence of indexes 104.
Parsing starts again at symbol xe2x80x98dxe2x80x99 in the input sequence of symbols 102. Symbol xe2x80x98dxe2x80x99 is stored in the coding table 100 at index 100. Symbol xe2x80x98dxe2x80x99 is concatenated with the next symbol xe2x80x98/xe2x80x99 in the input sequence of symbols 102 and the LZW coding table 100 is searched for the sequence of symbols xe2x80x98d/xe2x80x99. Since, xe2x80x98d/xe2x80x99 is not stored in the LZW coding table 100, xe2x80x98d/xe2x80x99 is learned by storing xe2x80x98d/xe2x80x99 in the next sequential entry in the coding table at index 261. The index for the previously identified symbol xe2x80x98dxe2x80x99, 100, is transmitted in the sequence of indexes 104.
Parsing starts again from the symbol xe2x80x98/xe2x80x99 in the input sequence of symbols 102. Symbol xe2x80x98/xe2x80x99 is stored in the LZW coding table 100 at index 47. Symbol xe2x80x98/xe2x80x99 is concatenated with the next symbol xe2x80x98wxe2x80x99 in the input sequence of symbols 102 and the LZW coding table 100 is searched for the sequence of symbols xe2x80x98/wxe2x80x99. xe2x80x98/wxe2x80x99 is stored in the coding table 100 at index 258, xe2x80x98/wxe2x80x99 is concatenated with xe2x80x98exe2x80x99 in the input sequence of symbols 102 and the coding table is searched for the sequence of symbols xe2x80x98/wexe2x80x99. Since xe2x80x98/wexe2x80x99 is not stored in the coding table 100, xe2x80x98/wexe2x80x99 is learned by storing xe2x80x98/wexe2x80x99 in the next sequential entry in the LZW coding table at index 262. The index for the previously identified sequence of symbols xe2x80x98/wxe2x80x99, 258, is transmitted in the sequence of indexes 104.
For example, for a symbol width of 8 bits, upon finding a match for the 24 bit (3 bytesxc3x978 bits) per byte string of characters xe2x80x98/wexe2x80x99, a 9-bit index (the address of the string of symbols xe2x80x98/wexe2x80x99 stored in the dictionary) is transmitted from the encoder to the decoder. This reduces the number of bits transmitted from 24 to 9. Upon receiving the 9-bit index the decoder regenerates the string of characters xe2x80x98/wexe2x80x99 stored at the 9-bit index in its copy of the dictionary. If no corresponding sequence (prefix) had been found in the dictionary, indexes for the individual symbols xe2x80x98/xe2x80x99, xe2x80x98wxe2x80x99 and xe2x80x98exe2x80x99 would be transmitted.
Transmission of the input sequence of twenty symbols 102 requires 160 bits (20 symbolsxc3x978 bits per symbol). The LZW algorithm reduces the number of bits transmitted to 126 bits (14 indicesxc3x979 bits). As the input sequence of symbols 102 is parsed, the vocabulary stored in the dictionary 108 in the coding table 100 increases and the lengths of the sequences of symbols stored in the coding table increase. For example, index 264 represents a sequence of four symbols xe2x80x98/weexe2x80x99.
The longer the sequence of symbols stored in the coding table, the better the compression because the number of indexes transmitted is decreased. Compression can also be improved by increasing the symbol width. However, the initial source alphabet required by the LZW algorithm increases by two for each bit added to the symbol width and thus requires an impractical table size for encoding an initial source alphabet for symbol widths of several bytes.
The present invention does not require initialization of a source alphabet in the dictionary. Instead, both the alphabet and vocabulary are learned and stored in the dictionary during the encoding of the input string of symbols.
Furthermore, in the prior art LZW, a large alphabet requires large symbol width indexes. An alphabet of 256 one byte symbols requires the indexes to start with 9 bits, an alphabet of 16384 two byte symbols requires an initial index of 17 bits. The width of the index directly affects the compression ratio. In the present invention, the size of the alphabet has no direct effect on the index width. Furthermore, only symbols which are used by a source are learned to the dictionary. Thus, the invention is suitable for sparse sources.
The dictionary is searched for a symbol or sequence of symbols received in a string of symbols. Upon detecting that the symbol is not stored in the dictionary the symbol is learned by storing the symbol in the dictionary, and the plain symbol is transmitted in a code word.
Upon detecting that a symbol or sequence of symbols is stored in the dictionary, the index at which the symbol or sequence of symbols is stored in the dictionary is transmitted in the code word. A state of a prefix field in the code word may identify the contents of the code word as either plain symbol to be learned or an index. The dictionary index may be of variable width dependent on the number of symbols and sequences of symbols that have been learned.
The symbol width may be selected dependant on the type of data. For example, for transmission of text, a single byte can readily define an individual character. However, for audio or video data, longer symbol widths by which streams of data are parsed may provide better compression. Because all symbols are not initially stored, but are only stored in the dictionary as they occur, large symbol widths can be used without fear of initially overloading memory.