The decoding of multiple video stream standards has become an important application for general purpose processors such as MIPS, x86, Sparc, and ARM platforms. However, conventional instruction sets for many general purpose processors are not easily amenable to performing the bit-oriented variable length decode (VLD) process to decode many different compressed video formats such as MPEG-1/2/4, H.263, and the Windows Media Format. Hardwired VLD blocks implemented entirely in core logic involve repeated investment to adopt different standards and can limit the flexibility of high-level programming language for video decoders.
Special properties of variable length code (VLC) are applied for decoding. By definition, the unique prefix property of the variable length code is used for decoding. However, exploiting the unique prefix property alone may limit performance to bit-serial. A direct table look-up can be used to perform the decoding operation. However, with VLC words of up to n bits of length, the table size grows exponentially (i.e., 2^n). Even in modern processors with high performance memory subsystems, such memory bandwidth is a serious performance detriment.
Some general purpose processors and digital signal processors include instructions to determine the leading number of zeros or the leading number of ones. The instructions can be used for VLD as well as other mathematical computations such as the normalization of a data set. The instructions for determining the leading number of zeros or ones combined with barrel-shift bitfield extraction instructions can improve the speed of variable length code decoding and provide reduced memory requirements. However, the speed provided by these instructions is less than the speed provided by specially designed hardware decoders. However, the disadvantages of using specially designed hardware decoders include the difficulty in providing a simple software abstraction level for high-level coding and the investment of redesigning the processor to accommodate newer video codecs.
With conventional approaches, the leading number of zeros or ones is used to load an intermediate value from a base address and an offset provided by the leading zero or one count. The leading zero count is arithmetically scaled and added to the base address. The value loaded from the base address contains a code word length and a base pointer to another table. The pointer and the variable length portion of the code are used by software to compute a new address. The pointer and part of the extracted bitfield are combined in a way similar to segmented address computation to obtain another address. The new address is used to index the VLC table that contains the decoded information. One disadvantage of the conventional approach is the use of two levels of indirection through the memory hierarchy and the computational overhead to obtain the addresses.
Referring to FIG. 1, an example of a conventional processing scheme 10 for VLC decoding is shown. An instant in an MPEG-4 video decode sequence of the conventional approach is shown. The boxes 12 and 14 represent interactions with the memory subsystem and the boxes 16, 18 and 20 represent software processing with special instructions. A conventional RISC processor with a lead-zero count instruction uses 15 to 18 cycles to perform such a computation. Even in superscalar or very long instruction word (VLIW) implementations, the instructions are difficult to implement in fewer clock cycles because most of the operations are serial in nature. Additionally, the uncertain latencies of the cached memory subsystem are encountered twice. In some architectures this problem is avoided by specifically locking locations of the VLC code table entries. In some processors the number of cycles is higher because the leading zero count determination is implemented through nested conditional loops (i.e., the processors do not have a leading zero count instruction).
It would be desirable to implement variable length decoding in a general purpose processor that uses less than 15 cycles to perform and is easily amendable to various video stream decoding processes.