The invention generally relates to data compression, and more specifically relates to a form of entropy coding.
In a typical coding environment, input data is encoded by an encoder, transmitted over a communication channel (or simply stored), and decoded by a decoder. During encoding, an input signal is typically pre-processed, sampled, converted, compressed or otherwise manipulated into a form for transmission or storage. After transmission or storage, the decoder attempts to reconstruct the original input.
One fundamental limitation of this simple model is that a given communication channel has a certain capacity or bandwidth. Consequently, it is frequently necessary to reduce the information content of input data in order to allow it to be reliably transmitted, if at all, over the communication channel. When the input signal is composed of symbols having equally likely probability of occurring, then an optimal encoding is to use equal length code words, where each bit of an n-bit code allows distinguishing among 2n equally probable input possibilities. Thus, a single bit (i.e., single entry code book) can distinguish two possibilities, two bits can distinguish four possibilities, etc.
But, it is not typical that an input stream has equal probability of receiving any particular message. In practice, due to semantic or syntactic structure in the input, certain messages are more likely than others. The objective of entropy coding is to take advantage of such data correlation and minimize the average length of the code words among expected inputs having certain highly probable inputs. When one message is more likely than others, a shorter code book entry is used for probable messages, and longer entries for the less probable.
Over time, tremendous effort has been invested in developing lossless and lossy compression techniques for reducing the size of data to transmit or store. One popular lossless technique is known as Huffman encoding, which is a particular form of entropy encoding. Entropy encoding operates by assigning variable-length codes (e.g., code book entries) to fixed-sized blocks of input. That is, a random variable X, which is known to take on values x1 . . . xm with corresponding probability p1 . . . pm, is mapped to an entry within a set of code words {Y}. Each code word yik in {Y} of length k, in turn, consists of a finite sequence of code alphabet entries {A}={a1 . . . ad}. For simplicity, yik will be referenced simply as yi, with k implied. For digital applications, the code alphabet is likely to be a series of binary digits {0, 1}, with code lengths measured in bits. It is assumed code words are constructed so only a single scan of a compressed representation needs to be inspected in order to reconstruct appropriate output. The difficulty in entropy encoding the source signal depends on the number m of possible values X may take. For small m, there are few possible messages, and therefore the code book for the messages can be very small (e.g., only a few bits need to be used to unambiguously represent all possible messages).
If it is known that a sender will provide a source signal containing one of four messages x1 . . . x4, having respective probabilities p1 . . . p4 of xc2xd, xc2xc, xe2x85x9, and {fraction (1/16)}, one can inefficiently represent the four possible messages with only two bits. However, this wastes bandwidth since the unequal probabilities indicate data characteristics than can be taken advantage of. A more efficient method, known as scalar Huffman encoding, requires an average of fewer than 2 bits to encode each message. A short code book entry is assigned to x1, since it is most probable, and increasingly longer codes for the less probable inputs. One possible set of code book entries is assigning xe2x80x9c1xe2x80x9d to represent message x1, xe2x80x9c01xe2x80x9d for message x2, xe2x80x9c000xe2x80x9d for message X3, and xe2x80x9c001xe2x80x9d for message X4. This gives an average code length of 1.56 bits instead of 2 for encoding the random variable Xxe2x80x94a significant savings.
The compression performance of this approach is limited virtue of each source symbol being assigned a code word with an integer number of bits. Traditional methods of overcoming this limitation include arithmetic coding and vector Huffman coding. These methods group X""s together into blocks or vectors which are themselves treated as another source symbol. That is, instead of simply encoding each individual message X, instead sequences of multiple X""s are encoded. Just as the individual symbol encoding methods take advantage of a symbol""s frequency of occurrence in an input stream, this latter approach of grouping symbols allows the encoder to take advantage of dependencies between input symbols. Coding sequences of X""s produces further savings as it is known from information theory studies that the entropy of a coherent series X1 . . . Xn is less than or equal to the sum of each individual X""s entropy.
Theoretically, vector Huffman coding can compress a coherent source much more efficiently than scalar Huffman coding. The efficiency of vector Huffman coding is only limited by practical concerns. In order to achieve higher compression ratios, bigger vector dimensions are needed. Higher dimension, however, increases code book sizes beyond practical limits. For example, for source symbols having 30 possible values, a dimension of only 6 corresponds to a code book of 729 million entries.
Note also that the above examples of entropy coding are characterized as fixed-to-variable length coding as the source symbols have fixed length and the code words have variable length depending on the probability of the corresponding source symbol. Another methods of entropy coding have been attempted which attempt the opposite approach, where a variable number of source symbols are grouped together and then translated into code words having equal length. When the source is composed of independent X""s, and symbol groupings achieve equal probability, such a reverse scheme is provably optimal. However, as with large dimension vector Huffman encoding, such solutions require resources exceeding resources practically (if at all) available. Additionally, for sources with data coherence, such as audio or visual data streams, this variable-to-fixed length approach is not useful.
The invention relates to a method of assigning variable length codes to variable length input sequences. In particular, entropy-type codes are assigned to probable input sequences, thus allowing a particular input stream to be encoded in a compressed format. When the invention is implemented in a resource-limited environment, it may be configured so as to reduce the size of the code book required for performing encoding and decoding. For example, variable length code words might only be assigned to inputs that are highly probable, and where default codes can be assigned to less probable sequences. The degree of probability required for assignment of a specific code to a specific input is adjusted according to a desired code book size.
The input stream to encode can be of any data type, such as numbers, characters, or a binary data stream which encodes audio, video or other types of data. For simplicity, the input stream is referenced herein as a series of symbols, where each xe2x80x9csymbolxe2x80x9d refers to the appropriate measurement unit for the particular input.
In particular, a code book is constructed for groupings of symbols, in which variable-sized groups of symbols are each assigned a variable length code based on probability of occurrence of symbol groupings. To construct the code book, possible groupings of symbols are generated and compared against the probability of the generated grouping occurring in exemplary input used to generate the code book. Such exemplary input is assumed to approximate arbitrary input likely to be received and require encoding. Note that the input may be used to drive generation of groupings, or all possible groupings can be tested and compared against the exemplary input; it is assumed herein that the groupings are input driven.
A data structure (or equivalent) may be used to track symbols combinations (e.g., the groupings). This structure is used to associate the new symbol with previously received symbols, so that arbitrarily long groupings of previously received symbols are tracked. One possible configuration for the data structure is a tree-type data structure, in which successive symbol groupings form new leaf nodes. These nodes may contain an entire grouping or just the single symbol extension to a previous parent node. In this latter configuration, the path from the root of the tree corresponds to a particular grouping.
In order to start the grouping and probability assignments, preferably one or more trivial groupings are selected, such as single symbol xe2x80x9cgroupsxe2x80x9d containing symbols from the input alphabet. The probability of these initial groupings is evaluated to determine the grouping most likely to occur as input, where such probability is necessarily computed with respect to exemplary inputs. The most probable grouping is then expanded with symbols from the alphabet to form tentative groupings. The probability of these tentative groupings is then evaluated to identify the most probable tentative expansions, and the least probable groupings combined into a single grouping.
The concept of a code book is to assign code words to symbol groupings. In a resource limited environment, the invention can be configured so that code book size is restricted. One method of doing so is avoiding assigning codes to all input sequences. Instead, only probable input sequences are stored in the code book and assigned an entropy-type code. Improbable sequences are represented in the code book as an input sequence prefix followed by a special expansion character suffix. This suffix character represents all possible input sequence extensions to the prefix. The prefix-suffix pairing represents all possible input sequences beginning with the prefix that do not have an entry in the code book. Thus, after evaluating the tentative extensions, two code book entries result, one for the most probable extension, and one to represent all other extensions (again, assuming only keeping one most probable extension).
This process of expanding groupings and collapsing groupings is repeated until the data structure has been filled to a predetermined capacity. However, even if the code book has been filled to capacity, note that the input may be continued to be scanned so that higher probability entries within the input displace less probable current entries. The efficiency of this coding technique is only limited by the book size selected and the typically of the exemplary input. Once the code book has been created, each book entry is assigned a entropy-type code, i.e., a code having a length inversely proportional to the entry""s probability.