1. Field of the Invention
The present invention is directed to entropy encoding and decoding, and it particularly concerns modeling employed for such encoding and decoding.
2. Background Information
Data compression usually includes multiple phases, where initial phases are more dependent on the specific data source. An encoder to be employed on sequences of symbols that represent values of pixels in an image, for example, may take the form of FIG. 1's decoder 10. The symbols may also include framing information, and the data may accordingly be subjected to, say, a two-dimensional discrete cosine transform 12. Some difference operation 14 may then be performed to express each value as a difference from one that came before.
This higher-level processing produces a sequence of symbols in which higher-level, domain- or source-specific regularities have been re-expressed as simple, generic (quantitative) regularities. Despite the differential operations, there is usually some skew in the resultant output's symbol distribution, and it is at this point that the entropy coding 16 may be employed to compress the data toward that sequence's entropy value.
In some cases, some measure of redundancy will then be re-introduced by, say, error-correction coding 18 in order to protect against corruption in a noisy transmission channel 20. If so, the result will be subjected to error-correction decoding 22 at the other end of the channel 20, and entropy decoding 24 will re-expand the compressed data to the form that emerged from the difference operation 14. An accumulator operation 26 will reverse the difference operation 14, and another discrete cosine transform 28 will complete the task of reconstituting the image. Again, the actual pixel-value data may be accompanied by framing, quantization, and other metadata.
As was stated above, the purpose of entropy coding is to compress the message length toward the message's entropy value, to approach optimal encoding. Optimal encoding is quantified as the message entropy, i.e. as the minimum number of bits per message averaged over all the messages from a given source. In the case of a source with a finite number of M distinct messages, all equally probable, the entropy H (per message) is log2(M) bits; i.e., no encoding can do better than sending a number between 0 and M−1 to specify the index of a given message in the full list of M messages. (In the remainder of the specification, log2 x will be expressed simply as “log x.”)
More often, though, messages' probabilities re not equal. A common entropy-coding scenario is the one in which messages are sequences of symbols selected from an alphabet A of R symbols α1, α2, . . . αR, generated with probabilities p1, p2, . . . pR that are not in general equal. The n-character message entropy is then:
                    H        =                  n          ⁢                                    ∑                              i                =                1                            R                        ⁢                                          p                i                            ⁢                              log                ⁡                                  (                                      1                    /                                          p                      i                                                        )                                                                                        (        1        )            
This value is less than log M if the probabilities are not equal, so some savings can result when some messages are encoded in fewer bits than others. Taking advantage of this fact is the goal of entropy coding.
The two types of general entropy-coding algorithms that are most popular currently are Huffman coding and arithmetic coding. The Huffman algorithm assigns to each symbol αi a unique bit string whose length is approximately log(1/pi) bits, rounded up or down to the next whole number of bits. The up/down rounding choice of each log(1/pi) depends on all the pi's and is made by using the Huffman tree-construction algorithm. If all the symbol probabilities happen to be of the form ½k, where k is a positive integer, the resultant encoding minimizes the average message length.
The principal weakness of the Huffman code is its sub-optimality in the case of more-general probabilities (those not of the form ½k). Huffman coding is especially inefficient when one symbol has a probability very close to unity and would therefore need only a tiny fraction of one bit; since no symbol can be shorter than a single bit, the code length can exceed the entropy by a potentially very large ratio. While there are workarounds for the worst cases (such as run-length codes and the construction of multi-character symbols in accordance with, e.g., Tunstall coding), such workarounds either fall short of optimality or otherwise require too much computation or memory as they approach the theoretical entropy.
A second important weakness of the Huffman code is that its coding overhead increases, both in speed and memory usage, when the adaptive version of the algorithm is used to track varying symbol probabilities. For sufficiently variable sources, moreover, even adaptive Huffman algorithm cannot build up statistics accurate enough to reach coding optimality over short input-symbol spans.
In contrast to Huffman coding, arithmetic coding does not have the single-bit-per-symbol lower bound. As a theoretical, albeit impractical, method, arithmetic coding goes back to Claude Shannon's seminal 1948 work. It is based on the idea that the cumulative message probability can be used to identify the message. Despite minor improvements over the decades, its fatal drawback was the requirement that its arithmetic precision be of the size of output data, i.e., divisions and multiplications could have to handle numbers thousands of bits long. It remained a textbook footnote and an academic curiosity until 1976, when an IBM researcher (J. Rissanen, “Generalised Kraft Inequality and Arithmetic Coding,” IBM J. Res. Dev. 20, 198-203, 1976) discovered a way to make the algorithm's arithmetic work within machine precision (e.g., 16, 32, or 64 bits) practically independently of the data size, with only a minor compression-ratio penalty for the truncated precision. (That technique retained only a log(n) dependency on the data size n if absolutely optimal compression was required.) Over the following decades, the algorithm evolved rapidly, chiefly through speed improvements (which are obtained in arrangements such as that of IBM's Q-coder with only a small additional loss of compression efficiency) and faster and more-flexible adaptive variants. By the mid-1990's the arithmetic coder had replaced the Huffman algorithm as the entropy coder of choice, especially in more-demanding applications. But arithmetic coding is like Huffman coding in that its performance suffers when source statistics change rapidly.
Theoretically, the slow-adaptability problem that these two popular entropy-encoding techniques share can be overcome by a relatively obscure compression technique known as “enumerative coding.” The roots of enumerative coding extend farther into the past than modem information theory, going back to the enumerative combinatorics of the Nineteenth and early Twentieth Centuries. And using combinatorial objects for ranking, as conventional enumerative encoding does, had actually been part of common computer-programming folklore for over a decade in 1966, when Lynch (T. J. Lynch, “Sequence Timecoding for Data Compression,” Proc. IEEE vol. 54, 1490-1491, October 1966) and, independently, Davisson (L. D. Davisson, “Comments on ‘Sequence Time Coding for Data Compression,’” Proc. IEEE vol 54, 2010, December 1966) used the same number representation and formulas to encode “sequence times” for digitized data samples, i.e., presented what is now referred to as enumerative encoding.
Conceptually, enumerative encoding lists all messages that meet a given criterion and optimally encodes one such message as an integer representing the message's index/rank within that list. In words, an example would be, “Among the 1000-bit sequences that contain precisely forty-one ones (and the rest zeros), the sequence that this code represents is the one with whose pattern we associate index 371.” That is, the example encoding includes both an identification of the source sequence's symbol population, (41 ones out of 1000 in the example), and an index (in that case, 371) representing the specific source sequence among all those that have the same symbol population.
Since the number of patterns for a given population can be quite large, it would not be practical to arrive at a significant-length sequence's pattern index by storing associations between indexes and patterns in a look-up table. Instead, one would ordinarily arrive at any given source pattern's index algorithmically, and the index-determining algorithm would typically be based on the value that the sequence represents. In accordance with one such indexing approach, for example, the prior example may alternatively be expressed in words as, “The sequence that this code represents is the 371st-lowest-valued 1000-bit sequence that contains precisely 41 ones,” and it would therefore be possible to determine the index algorithmically.
Consider the seven-bit sequence 1001010, for example, i.e., one of the sequences that has three ones out of seven bits. The task is to determine an index that uniquely specifies this sequence from among all that have the same population, i.e., from among all seven-bit sequences that have three ones and four zeros. In accordance with an indexing scheme in which indexes increase with the sequence's value and the more-significant bits are those to the left, the index can be computed by considering each one-valued bit in turn as follows. Since the example sequence's first bit is a one, we know that its value exceeds that of all same-population sequences in which all three ones are in the remaining six bits, so the index is at least as large as the number of combinations of three items chosen from six, i.e., 6!/(3!*3!), and we start out with that value. Out of all same-population sequences that similarly start with a one bit, the fact that the example sequence has a one in the fourth bit position indicates that its index exceeds those in which both remaining ones are somewhere in the last three bit positions, so the index is at least as large as the result of adding the number of such sequences to the just-mentioned number in which all three are in the last six positions. By following that reasoning, the index I can be determined in accordance with:
  I  =                    (                                            6                                                          3                                      )            +              (                                            3                                                          2                                      )            +              (                                            1                                                          1                                      )              =                  20        +        3        +        1            =      24.      I.e., the index can be determined by summing combinatorially determined add-on values.
Now, that index requires five bits, and it would take three bits to specify the population value, so the resultant eight bits exceeds the length of the (seven-bit) source sequence. But it is apparent that the comparison of the source-sequence length with the index length would be more favorable for a more-skewed population in a longer sequence. And the number of bits required for the “side information” that specifies the population increases only as the logarithm of the sequence length. Over a group of such sequences, moreover, that side information can itself be compressed. So the resultant code length approaches source entropy as the source-sequence length becomes large.
The combinatorial values used as “add-on” terms in the index calculation can be expensive to compute, of course, but in practice they would usually be pre-computed once and then simply retrieved from a look-up table. And it is here that enumerative coding's theoretical advantage over, say, arithmetic coding is apparent. Just as combinatorial values are successively added to arrive at the conventional enumerative code, successive “weight” values are added together to produce an arithmetic code. And arithmetic coding's weights can be pre-computed and retrieved from a look-up table, as enumerative coding's combinatorial values can. In arithmetic coding, though, the values of such add-on terms are based on an assumption of the overall sequence's statistics, and the arithmetic code's length will approach the source sequence's theoretical entropy value only if statistics of the source sequence to be encoded are close to those assumed in computing the add-on terms. To the extent that source statistics vary, the look-up table's contents have to be recomputed if near-optimal compression is to be achieved, and this imposes a heavy computational burden if the source statistics vary rapidly. In contrast, enumerative coding's table-value computation is not based on any assumption about the sequence's overall statistics, so it can approach theoretical entropy without the computation expense of adapting those values to expected statistics.
Enumerative coding has nonetheless enjoyed little use as a practical tool. The reason why can be appreciated by again considering the example calculation above. The sequence length in that example was only seven, but the lengths required to make encoding useful are usually great enough to occupy many machine words. For such sequences, the partial sums in the calculation can potentially be that long, too. The calculation's addition steps therefore tend to involve expensive multiple-word-resolution additions. Also, the table sizes grow as N3, where N is the maximum block size (in bits) to be encoded, yet large block sizes are preferable, because using smaller block sizes increases the expense of sending the population value.
Arithmetic coding once suffered from the same drawback, but the Rissanen approach mentioned above solved the problem. Basically, Rissanen employed add-on values that could be expressed as limited-precision floating-point numbers. For example, the resolution might be so limited that all of each add-on value's bits are zeros except the most-significant ones and that the length of the “mantissa” that contains all of the ones is short enough to fit in, say, half a machine word. Even if such an add-on value's fixed-point expression would be very long and that value is being added to a partial sum that potentially is nearly as long, the resolution of the machine operation used to implement that addition can be small, since the change if any in the partial sum occurs only in a few most-significant bits. Rissanen recognized that add-on values meeting such resolution limitations could result in a decodable output if the total of the symbol probabilities assumed in computing them is less than unity by a great enough difference and the values thus computed are rounded up meet the resolution criterion. (The difference from unity required of the symbol-probability total depends on the desired resolution limit.)
Still, the best-compression settings of modern implementations require multiplications on the encoder and divisions on the decoder for each processed symbol, so they are slower than a static Huffman coder, especially on the decoder side. (The particular degree of the speed penalty depends on the processor.) By some evaluations, moreover, the arithmetic coder compresses even less effectively than the Huffman coder when its probability tables fail to keep up with the source probabilities or otherwise do not match them. So it would be desirable to find some type of the limited-resolution version of enumerative encoding, i.e., if the add-on terms added together to arrive at the enumerative-encoding index could be rounded in such a fashion as eliminate the need for high-resolution additions and to be expressible in floating-point formats so as to limit table-entry sizes.
For a long time, though, this did not appear possible; it turns out that the indexes that result from simply rounding the conventional path-count values to such low-resolution values are not always unique. But U.S. patent application Ser. No. 11/015,894, which was filed on Dec. 17, 2004, by Ratko V. Tomic for Fast, Practically Optimal Entropy Coding and is hereby incorporated by reference (“my previous application”), discloses how to select the add-on values in a way that both satisfies the short-mantissa requirement and produces a decodable result-and still achieves nearly the same degree of compression that the conventional binomial values do. Instead of using the “top-down” approach rounding up the closed-form binomial-coefficient formula in accordance with N(n,k)=┌C(n,k)┐R≡┌n!/[(n−k)!k!]┐R, where R is some conveniently small integer and ┌x┐R means rounding x up to a floating-point value whose mantissa can be expressed in R bits, I proposed to use a “bottom-up” approach, in which I modify the recursive binomial-coefficient formula C(n,k)=C(n−1, k−1)+C(n−1, k−1) to arrive at N(n,k)=┌N(n−1, k−1)+N(n−1, k−1)┐R, where N(1,k)=C(1,k).
The amount of computational effort required by that entropy-coding approach for a given source message received by the composite encoder depends on the “modeling” that precedes it. The initial encoding steps shown above can be thought of as high-level types of modeling, whose applicability is limited to a narrow range of applications. There are also lower-level modeling operations, of which the drawing shows no examples, whose ranges of applicability tend to be broader.