Examples of entropy encoded data streams are compressed video data streams including that of the new “H264” video encoding standard (“ITU-T Recommendation H.264: Advanced video coding for generic audiovisual services”). Wiegend et al provide a somewhat shorter summary of this specification in “An Overview of the H.264/AVC Video Coding Standard” (IEEE Trans. On Circuits and Systems for Video Technology July 2003).
Most video compression schemes include some form of entropy encoding whereby ‘raw’ data symbols are replaced by a representation that reflects their probability of occurrence so that frequently occurring symbols are encoded with a representation with fewer bits, whilst infrequent symbols are encoded with longer representations. Shannon's theory states that the optimal number of bits for a symbol with probability, p, is −log(p)/log(2). For example, a symbol with a 1 in 3 chance of occurring is optimally represented by 1.585 bits.
Many of the encoding schemes use Variable Length Coding (VLC) systems, similar to that of Huffman's, to do entropy encoding. Such schemes are generally very easy to encode and decode, however, as each code is always an integer number of bits in length, it generally fails to achieve to the optimum described by Shannon.
More recent alternatives to VLC schemes include Arithmetic Encoding (an introduction to which can be found in “Numerical Recipes in C”, Press et al ISBN 0-521-43108-5), and the virtually equivalent Range Encoding. Both of these encoding schemes are more advanced entropy encoding schemes which do get very close to Shannon's optimum by, in effect, allowing symbols to be represented with fractional numbers of bits. One drawback, however, is that they are much more complex to encode and decode than a VLC approach.
The new H264 standard, in particular its CABAC mode, (see “Context-based adaptive binary arithmetic coding in the H.264/AVC video compression standard” Marpe et al. IEEE Transactions on Circuits and Systems for Video Technology, July 2003), uses a form of arithmetic encoding. The H264 CABAC scheme makes the process even more challenging by . . .    a) Using the arithmetic encoder/decoder to encode only two symbols, i.e. either 0 or 1, rather than sets of values. It should be noted, however, that an arithmetic decoder that can decode from a choice of more than two symbols is also more expensive to construct. Typically, an encoder or decoder that directly handles N symbols will have a cost of O(N), whereas processing N symbols using a two symbol decoder (over multiple steps) will cost O(log(N)).    b) Updating the statistics (known in H264 as the context) used to perform the encoding/decoding after every encoded/decoded bit.    c) Maintaining numerous contexts which can be selected on a bit by bit basis.    d) Assembling the decoded arithmetic bits into symbols by using a “debinarisation” process which can include several steps.
In the case of the Inverse Discrete Cosine Transform (IDCT) parameters, this includes decoding a significance map, decoding sign bits for the non zero symbols, decoding unary string data for non-zero symbols, and decoding exponential Golomb data for the large non zero symbols. Each of these decoding steps selects the different contexts which are fed back to control the arithmetic decoder.
By following these steps a very high compression ratio can be achieved. There is, however, a cost penalty. It makes it very difficult, if not impossible, for hardware to decode more than one bit per clock (for a clock rate, say, in the range of 100˜200 Mhz). Although some progress has been made in parallelising arithmetic encoding (assuming the context remains constant) none appears to have been made for the decoding (see “Arithmetic Encoding in Parallel”, Supol and Melichar). If each frame consisted of multiple ‘slices’ (i.e. portions of a frame), then it would be possible to decode each slice in parallel, but as there is no guarantee that the incoming video stream will have more than one slice per frame this is not a suitable approach.
Furthermore, as stated in d) above, source symbols in the video stream generally consist of multiple bits (for example the source values may be signed 16-bit values) and so CABAC also employs VLC encoding schemes, such as unary and Golomb coding. When encoding the video data, the system must first convert each raw symbol value to the VLC binary encoded version (known in H264 as ‘binarisation’), which in turn is then compressed by the binary arithmetic encoder. The decoder essentially runs these steps in reverse to obtain the original data stream. This means that, in the worst case, even if the arithmetic decoder can decode one bit per clock, it may take many clock cycles to obtain the final symbol. For example, when processing the IDCT data, it takes of the order of 30 cycles to decode a symbol value of “64” using the CABAC process, assuming an arithmetic decode rate of one bit per clock, whilst a simple “+1” value would take four clocks to decode with CABAC. The decode costs associated with the larger values are offset by their very low probability and by the very high probability of zero values, which are decoded at much faster rates. Some examples of the number of bits needed to encode values for the IDCT coefficients (including the significance map), using the binarisation process, are given in the following table:
Number of bits inSymbol Value“Binarised” representation0 1 (or 0 when implied by thesignificance map)+/−1  4+/−2  5+/−5  8+/−10  13+/−20  23+/−50  29+/−100 31+/−200 33+/−500 35+/−1000 37+/−2000 39+/−5000 43+/−1000045+/−2000047+/−5000049+/−6553549
The decode timing is further complicated, as mentioned in d), by the fact that the context data that is used to control the arithmetic decoder, is frequently chosen on a bit-by-bit basis, dependent on the value of previously decoded bit. This means that if it is necessary to decode one bit per clock, then either the arithmetic decoding of a bit and the de-binarisation decision step based on the value of that bit must run in the same clock period or some kind of speculative, parallel decoder must be constructed. The first may not be possible to achieve with today's technologies and the second is costly to implement. Therefore, it is likely that an H264 arithmetic decoder hardware solution running at 100˜200 MHz will not achieve one bit per clock which further compounds the timing problems.
Although a compressed video stream will have a well specified average data rate, for example 50 Mbit/s for a high definition video (i.e. 1920×1080 pixels @ 30 fps), the instantaneous rate of data in that video stream can vary enormously from frame to frame. Some frames, known as Intra-encoded (or I-frames), generally have a large proportion of the data whilst Predicted-frames (P-frames) and Bidirectional-frames (B-frames) require far fewer bits as they borrow data from previously decoded frames. For example, in a given sample video stream encoded at 40 Mbit/s, each I-frame typically required around 3 Mbits, with P- and B-frames being around one half and one third that that size respectively. It is not difficult to produce a hardware CABAC decoder (say, running at 100˜200 MHz) that can decode a video stream that uses the maximum rate of 50 Mb/s, but a conventional decoder will not produce that data at a steady per-frame rate.
If the processing costs of each video frame purely depended on the amount of entropy encoded data, then there would be no problem to solve. There are, however, parts of the processing which are more fixed in their processing rate. For example, the IDCT calculation, motion compensation, and de-ringing/de-blocking units will generally take time that depends on the number of pixels processed. Since these units are more constant in their processing times, we have a situation where, in effect, there is a relatively fixed-rate consumer process, i.e. the back-end video processor, being fed by a highly variable source producer, the entropy decoder.
For example, a hardware solution might perform the IDCT processing for an H264 macro-block consisting of 384 pixels, in ˜400 clock cycles. The parameters for this data, as supplied by the entropy decoder, may have anywhere from 0 to 384 symbols depending on the complexity of the image in the macro-block region, and consequently take anywhere from zero to many thousands of clock cycles to be entropy decoded.
This leads to situations when, very frequently, one unit is stalled by the other due to the mismatch of processing rates. This, in turn, results in the overall processing rate, in terms of frames per second, dropping to the instantaneous minimum of the two, which could mean the system would fail to meet the required decoding frame rate. Looking again at the IDCT example above, if the entropy encoder could decode each symbol at the rate of one symbol per clock, there would be no problem.
As stated, it is impractical (if not impossible) to make the entropy decoder run faster, and making the consumer units run faster may be exceedingly expensive. The obvious and time-honoured tradition in these situations is to introduce a rate-smoothing FIFO between producer and consumer. This common practice is widely used in devices ranging from 3D graphics processing chips to hard-disk controllers.
Although a simple FIFO will ‘solve’ the problem, there is the inconvenience that it may need to contain many frames' worth of data in order to effectively smooth the data rate. This would typically force the buffer to be in external memory which implies not only that it ties up a large amount of RAM but also that it consumes considerable bandwidth for writing and reading of the buffer. Earlier video standards have used Huffman/VLC encoding, which can easily be encoded and decoded at a symbol-per-clock rate, and so a fairly obvious choice is to use such an encoding scheme to recompress the data on input to the FIFO and then decompress the compressed data again on exit. In fact, in a video encoding/decoding chip that handles multiple standards, such VLC hardware will exist anyway. This rather straightforward approach appears to have also been described by Linzer and Leung (U.S. Pat. No. 6,927,710).
A related scheme is described by Schumann (US patent application 20040260739). In this method, rather than immediately ‘debinarising’ the data, the ‘binarised’ bits that are output from the CABAC's arithmetic decoder are fed into the FIFO and then a (second) de-binarisation unit is placed at the other end of the FIFO. (Note that a partial de-binarisation unit is still needed adjacent to the arithmetic decoder unit in order to select the correct contexts). The binarised data (i.e. the data still encoded in, for example, significance map+unary+Golomb form) is still a reasonably well compressed format. Unfortunately, a likely problem with this method is that it may require a very large window of bits and complicated hardware in order to allow each symbol to be decompressed in a single clock cycle. This can be seen from the IDCT binarisation cost table given previously.
Finally, also of interest is an examination of the typical values that are decoded from the CABAC unit in an H264 video stream. Of particular interest are the IDCT coefficients since these typically make up the vast bulk of the decoded data. The following table gives the values and probabilities of IDCT values as sampled from a typical H264, high definition, video stream. Listed alongside the probabilities are the optimum storage costs.
Residual ValueProbability (%) + optimum # bits072.87% 0.46+19.26%3.43−19.56%3.39+22.04%5.61−22.14%5.55+30.87%6.85−30.90%6.79Others<0.5%>7.7
With each block of IDCT values, many of the higher frequency terms are very likely to be zero. If the contiguous high frequency zeros are removed (by some, yet unspecified, means), the table becomes:
Residual ValueProbability (%) + optimum # bits051.88% 0.95+116.42% 2.61−116.96% 2.56+23.62%4.79−23.79%4.72+31.54%6.02−31.60%5.96others<0.8%
If the remaining zeros can be implied by some means (for example, via the use of Run-Level or perhaps significance map encoding), then the probabilities of non-zero values then becomes:
Residual ValueProbability (%) + optimum # bits+134.12%1.6−135.25%1.5+27.53%3.7−27.88%3.7+33.20%5.0−33.33%4.9
If these very common values are encoded using a simple VLC entropy scheme, such as Huffman which must use an integer number of bits to represent each symbol, then it should be apparent that it will be far from optimal since the best fit for the very common +/−1 values will be two bits, representing approximately a 25% increase in storage costs relative to the ideal.
As stated, arithmetic encoding is superior to that of the VLC/Huffman approaches in that it can represent symbols with fractional numbers of bits, leading to potentially greater levels of compression. The problem is that it an adaptive scheme can be rather sequential to decode and thus makes it difficult to achieve a peak rate of one entire (multi-bit)symbol per clock.
A rate smoothing FIFO can be introduced between the producer and consumer of the decoded data but, if that is represented in a naïve form, it can require a very large FIFO. A compression scheme can be applied to the data going in to the FIFO that will reduce the size required for the FIFO but it is desirable to increase the efficiency of the known compression schemes to further reduce the required size of the FIFO.