Entropy coding plays a key role in Information theory. By definition, the entropy H(X) is the minimum rate by which a discrete source X with alphabet {x1, x2, . . . , xN} can be lossless encoded. The goal of entropy coding is then to define a code C which allows the encoding of the source alphabet by approximately the rate of entropy. In principle, this is possible by using Variable Length Codes (VLC), like the famous Huffman code. One important constraint of VLC codes is the prerequisite of integer bit allocation, which means that each symbol is coded with an integer number of bits.
This constraint is overcome by Arithmetic Coding, a kind of entropy coding which assigns a code to a whole message, rather than to source symbols, so that each symbol of the message is actually encoded with a fractional number of bits, thus achieving a final rate which is closer to the entropy.
Context-Based Adaptive Binary Arithmetic Coding (CABAC) is one of the two entropy coding methods of the ITU-T/ISO/IEC standard for video coding, H.264/AVC (cf. ITU-T and ISO/IEC JTC 1, “Advanced Video Coding for Generic Audio-Visual Services”, ITU-T Rec. H.264 and ISO/IEC 14496-10 (MPEG-4 AVC), Version 11, March 2009, which is incorporated by reference). The CABAC method utilizes a context sensitive, backward-adaptation mechanism for calculating the probabilities of the input symbols. The context modeling is applied to a binary sequence of the syntactical elements of the video data such as block types, motion vectors, and quantized coefficients binarized using predefined mechanisms. Each bit is then coded with either adaptive or fixed probability models. Context values are used for appropriately adapting the probability models.
FIG. 1 is a block diagram of a conventional H.264 decoder and shows the arrangement of the CABAC decoder within the H.264 decoder.
The input bit stream is received by the entropy decoder 110 in order to decode header information, motion vectors, and transform coefficients. The transform coefficients are reordered (block 120), and subjected to inverse quantization and inverse transform processing in blocks 130 and 135. The result is the prediction error signal, to which either one of the inter-prediction signal or the intra-prediction signal is added by means of the adder 140. The inter-prediction signal is obtained by the motion compensation block 150 on the basis of the motion vector information and the reference frames stored in block 160. The intra-prediction signal is computed by intra-prediction block 170. The output signal of the adder 140 is then fed through the deblocking filter 180 in order to obtain the reconstructed frame 190.
FIG. 2 is a detailed block diagram of the CABAC decoder, which is part of the entropy decoder 110 in FIG. 1 and performs the following three processes, namely context modeling, binary arithmetic decoding, and debinarization.
Context modeling is performed by the context modeler block 220, which defines which syntax element is to be decoded now and finds the index of the context to be used for current bin decoding, based on neighboring information and other parameters.
Binary arithmetic decoding is performed by the regular decoding engine 230, which receives the input bits, for instance by means of direct memory access (DMA), and processes them in three sub-stages to produce an output bin string, i.e., a sequence of binary digits. The three sub-stages are (i) buffer align & context lookup, (ii) decoding and renormalization, and (iii) next state and context update. Bins that are encoded without usage of an explicitly assigned model are decoded by the bypass decoding engine 240. The selection between the two decoding engines is performed by switching unit 210.
Debinarization, i.e., the inverse binarization stage, is performed by the debinarizer block 260, which converts bin strings to non binary valued syntax elements. Binary valued syntax elements are bypassed by means of switching unit 250.
CABAC provides an unconditional compression of approx. 19% irrespective of the input stream. However, the complexity of the encoding process of CABAC is far higher than the table driven entropy encoding schemes such as the Huffman coding. CABAC is also bit serial and its multi-bit parallelization is extremely difficult. Consequently, CABAC occupies a large chunk of total time required for H.264 Decoding.
Most decoding processes, except for CABAC, may be parallelized/pipelined. Hence, CABAC becomes a bottleneck issue when HDTV H.264 decoding is used on embedded systems.
Most implementations of CABAC decoding are done partly in hardware and partly in firmware, which is insufficient for real time decoding of HDTV video on embedded systems. Therefore, there is a need for a dedicated and independent co-processor which, when sent an initiation signal, outputs a CABAC decoded macroblock (hence taking the load off the host) into a dedicated FIFO from where the macroblock data may be picked up by the host or video decoder for pixel decoding until the end of a picture/slice is reached.
A hardware implementation of the CABAC decoding process is, for instance, known from the article by Chang Yuan-Teng (“A Novel Pipeline Architecture for H.264/AVC CABAC Decoder”, IEEE Asia Pacific Conference on Circuits and Systems, 2008, which is incorporated by reference) or the article by Y. Yi and I.-C. Park (“High-Speed H.264/AVC CABAC Decoding,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 17, no. 4, pp. 490-494, 2007, which is incorporated by reference). These conventional approaches are based on the pipelinability of the CABAC decoding process into three pipeline stages. This has the drawback of requiring a backward path in the pipeline stages due to the updating of the context memory, often leading to unnecessary stalls. Though the authors achieve real-time performance, this performance is strictly limited to the input video stream. An input having many continuous accesses to a single context memory location would probably suffer from multiple stalls, leading to a drop in performance.
A hardware accelerator for CABAC decoding is also known from an article by Jian-Wen Chen, Cheng-Ru Chang, and Youn-Long Lin, “A Hardware Accelerator for Context-Based Adaptive Binary Arithmetic Decoding in H.264/AVC” in Proc. IEEE ISCAS, May 2005, vol. 5, pp. 4525-4528, which is incorporated by reference, and wherein decoding is controlled by an optimized finite state machine. This accelerator, however, is only capable of processing a maximum video resolution of 352×288 in real-time and, therefore, may be unsuitable for HDTV applications. Moreover, the Binary Arithmetic Decoder (BAD), the block responsible for reading the compressed bitstream and managing the arithmetic decoding and renormalization processes, is part of the syntax element decoding block, which is not the most efficient implementation of the BAD block. Finally, the conventional accelerator uses two separate memories for storing IDCT coefficients in a ping-pong fashion, hence inferring the use of pipelining in the design which has its added costs. Also, by providing direct host access to its coefficient memories, the conventional CABAC accelerator needs to handle high volume and frequent inter-block communication.
A pipeline-based architecture for CABAC decoding is also known from an article by Junhao Zheng, David Wu, Don Xie and Wen Gao, “A Novel Pipeline Design for H.264 CABAC Decoding” in Advances in Multimedia Information Processing—PCM 2007, vol. 4810/2007, pp. 559-568, which is incorporated by reference, and wherein an efficient finite state machine is developed to match the requirement of the pipeline controlling and the critical path is optimized for the timing. This approach, however, is capable of decoding the coefficient information only. Extending this approach to a decoding of all syntax elements while maintaining the required 1 bin/cycle processing is not possible for certain technical reasons.
First of all, the approach of Zheng et al. uses a “Context Register Bank” to store the contexts pertaining to coefficient decoding, and this register bank is present inside the IP and uses some kind of prefetching of contexts from the main context memory, which lies outside the IP. Coefficient contexts are just a fraction of the total number of contexts supported by the H.264 standard, and hence the register bank is much smaller in area than the main context memory. In order to decode all CABAC syntax elements, the cycles needed have to be taken into account, i.e., firstly, to fetch a chunk of required contexts from the main memory to the context register bank and, secondly, the cycles needed to write the updated contexts back into the main memory. Hence, there is a context switch every time. Taking this cycle latency due to context switching into account, maintaining a 1 bin/cycle is not possible, even with pipelining.
Secondly, the approach of Zheng et al. does not take the positioning of the neighbor macroblock memory and the state tables into account, which represents another significant factor affecting the throughput of the design. In fact, all the syntax elements except coefficient data require neighboring macroblock data for decoding. Hence, the affect of accessing Neighbor Memory has not been accounted for by Zhen et al.