The H.264 video coding standard jointly developed by ITUT and MPEG provides the state-of-the-art video coding techniques to achieve high coding efficiency. It includes an enhanced Intra-prediction technique. Intra-prediction exploits spatial redundancy in picture. A macroblock or block is predicted from earlier decoded adjacent pixels. The H.264 standard supports a rich set of prediction patterns for intra-prediction. These include nine prediction modes for 4 by 4 luminance blocks and four prediction modes for 16 by 16 blocks. This increased number of intra-prediction modes increases decoding complexity. The decoding modes for 4 by 4 blocks involves many conditional branch statements for prediction mode selection within a macroblock. This is not suitable for very long instruction word (VLIW) processors, which perform best on parallel code without conditional branches.
FIG. 1 illustrates the nomenclature used for description of 4 by 4 intra prediction. The pixels labeled a to p are the 4 by 4 blocks to be predicted. The pixels labeled A to D are neighbor pixels of a top block and pixels E to H are top right neighbor pixels. Pixels I to L pixels are left neighbors. Pixel M is the top left corner neighbor. Some or all of these neighbor pixels may not available for a given 4 by 4 block depending upon its position in the frame. Each pixel can take a value 0 to 255 and is represented by an 8-bit data word.
FIG. 2 schematically illustrates the 9 prediction modes for 4 by 4 blocks. The arrows represent the direction of prediction for each mode to create a prediction buffer pred[16]. Modes 0 and 1 are respective vertical and horizontal modes, in which top and left pixels are extrapolated. Mode 2 is a DC mode in which an average of available pixels from the top and left neighbor pixels is taken. In modes 3 to 8, predicted pixels are formed by weighted average of neighbor pixels. For instance, pixel a for prediction mode 4 (diagonal down-right) is created by rounded value of (I/4+M/2+A/4). A prediction mode is selected during encoding only if all required neighbor pixels are available. As a special case, if top right neighbor pixels E to H are not available, the neighbor pixel D is replicated in pixels E to H for the purpose of prediction.
The encoder can may use a 4 by 4 intra prediction mode at 16 by 16 macroblock (MB) level. There are sixteen 4 by 4 luminance sub-blocks in a 16 pixel by 16 pixel macroblock. FIG. 3 illustrates the predetermined scanning order according to the standard. The encoding mechanism for 4 by 4 luminance prediction involves following steps. The macroblock is scanned in the order and for each 4 by 4 block a best prediction mode 0 to 8 is selected. A prediction buffer is created according to the selected prediction mode. The intra predicted 4 by 4 block is subtracted from 4 by 4 block to be encoded. The residual error data is encoded. The information of the selected prediction mode for the 16 4 by 4 blocks is encoded separately.
Decoding a macroblock encoded with 4 by 4 intra prediction is as follows. The macroblock is scanned in order illustrated in FIG. 3. The decoder reads the separately encoded prediction mode and determines the availability of neighbor pixels. The decoder reads the neighbor pixels specified by predictions mode and creates a prediction buffer pred[16]. The decoder separately decodes the residual error data, which are signed 16 bit values. The decoder adds the prediction buffer pixel to residual error data of the current pixel to obtain the reconstructed pixels. After this addition, the decoder saturates the reconstructed pixels to 8-bit values. The decoder writes the reconstructed pixel values into an output buffer. Some of these reconstructed pixels may act as neighbor pixels for following 4 by 4 blocks.
Table 1 shows the interface of the 4 by 4 intra prediction decoder function just described.
TABLE 1ArgumentDescriptionpred_ptr[16]Array of prediction modes for 16 4 by 4blocks. The values are the 9 enumeratedprediction modes.top_pixels[20]Array of top and top right neighborpixels. This array contains 20 pixels asneighbors of top 4 blocks of the MB.left_pixels[16]Array of left neighbor pixels. Thisarray contains 16 pixels as neighbors of4 left blocks of the MB.corner_pixels[4]Array of top left corner pixels. Thisarray contains 4 pixels for 4 left blocks(0, 2, 8, 10) of the MBleft_availbpBit pattern for availability informationfor left neighbors I to L. This is a 16bit value; one bit per 4 by 4 blockindicating whether the neighbor isavailable.top_availbpBit pattern for availability informationfor top neighbors A to D. This is a 16bit value; one bit per 4 by 4 blockindicating whether the neighbor isavailable.topright_availbpBit pattern for availability informationfor top right neighbors E to H. This isa 16 bit value; one bit per 4 by 4 blockindicating whether the neighbor isavailable.in_ptrPointer to input residual error data.out_ptrPointer to output reconstruction buffer.zigzag_tblLookup table for pointer movement across4 by 4 blocks within the MB asillustrated in FIG. 3.The top_pixels, left_pixels and corner_pixels arrays contain the neighbor pixels of the current macroblock. If any neighbor pixel is not available, the array entry has a value of 128. The values in these arrays are updated while processing 4 by 4 blocks which are the neighbor pixels for subsequent blocks. For example: the reconstructed pixels d, h, l and p of block 0 are left neighbor pixels I, J, K and L for block 1; the reconstructed pixels m, n, o and p of block 0 are top neighbor pixels A, B, C and D for block 2; the reconstructed pixel p of block 0 is the top left corner neighbor M for block 3; and the reconstructed pixels m, n, o and p of block 1 are top right pixels E, F, G and H for block 2.
The DC prediction value in the DC mode is calculated as follows. If top and left neighbor pixels are available, then:dc=(A+B+C+D+I+J+K+L+4)/8  (1).If only top neighbor pixels are available, then:dc=(A+B+C+D+2)/4  (2).If only left neighbor pixels are available, then:dc=(I+J+K+L+2)/4  (3).If no neighboring pixels are available, then:dc=128  (4).Note that because unavailable neighboring pixels in the pixels arrays have a value of 128, the DC calculations are identical for the first and last cases above.
The neighbor, prediction and reconstructed pixels are all 8-bit unsigned values from 0 to 255. The residual error data is a 16 bit signed value. Addition between prediction buffer and residual error involves bit shifting and saturation in the standard.
A typical implementation of the 4 by 4 intra prediction function on a VLIW processor such as the Texas Instruments TMS3206400 would include the following optimization techniques. This typical implementation would use packed storage data placing four 8-bit values in 32 bit registers. This typical implementation would use of single instruction multiple data (SIMD) instructions such as DOTPU4 which calculates the dot product of four 8-bit values for efficient prediction values. This typical implementation would use registers instead of read/write memory for the prediction buffer.
Generally maximum utilization of a VLIW processor uses software-pipelining of a loop to achieve parallelism and maximum utilization of processing units of the processor. Software-pipelining is feasible only when loop does not have conditional branch statements. Software pipelining of intra prediction function of the prior art even using above optimizations is not feasible. This is because there are conditional branching based upon selection from the 9 prediction modes. Thus the prior art techniques under utilize a VLIW processor.