Video compression involves encoding/decoding of pixel information in 16×16 pixels macroblocks. The new emerging standards like (MPEG4, H.264, and Windows Media) provide a flexible tiling structure in a macroblock. It allows the use of 16×16, 16×8, 8×16, 8×8, 8×4, 4×8, and 4×4 sub-macroblock sizes. A Finite Impulse Response (FIR) filter (de-blocking filter is applied to every decoded macroblock edge to reduce blocking distortion resulting from the prediction and residual difference coding stages of the decoding process. The filter is applied on both 4×4 block and 16×16 macroblock boundaries, in which three pixels on either side of the boundary may be updated using a five-tap filter. The filter coefficients set or “strength” are governed by a content adaptive non-linear filtering scheme. This is done in a number of ways. Windows Media Video decoder (wmv) uses one protocol involving the boundary strength across block boundaries. H.264 or MPEG-4 part 10 uses pixel gradient across block boundaries.
The de-blocking filter has two benefits: block edges are smoothed, improving the appearance of decoded images (particularly at higher compression ratios). And in the encoder the filtered macroblock is used for motion-compensated prediction of further frames, resulting in a smaller residual after prediction.
The 2D adaptive filter is applied to both vertical and horizontal edges of 4×4 sub-macroblocks in a macroblock, in the following order vertical first and then horizontal. Each filtering operation may affect up to three pixels on either side of the boundary. In 4×4 pixel sub-macroblocks there are 4 pixels on either side of a vertical or horizontal boundary in adjacent blocks p and q (p0,p1,p2,p3 and q0,q1,q2,q3). Depending on the coding modes of neighboring blocks and the gradient of image samples across the boundary, several outcomes are possible, ranging from (a) no pixels are filtered to (b) p0, p1, p2, q0, q1, q2 are filtered to produce output pixels P0, P1, P2, Q0, Q1 and Q2.
The choice of filtering outcome depends on the boundary block strength parameter and on the gradient of image samples across the boundary. The boundary strength parameter Bs is chosen according to the following rules:
p or q is (intra coded and boundary is a macroblockBs = 4 (strongestP0, P1, P2,boundary)filtering)Q0, Q1, Q2p or q is intra coded and boundary is not a macroblockBs = 3P0, P1,boundaryQ0, Q1neither p or q is intra coded; p or q contain codedBs = 2P0, P1,coefficientsQ0, Q1neither p or q is intra coded; neither p or q contain codedBs = 1P0, P1,coefficients; p and q have different reference frames or aQ0, Q1different number of reference frames or different motionvector valuesneither p or q is intra coded; neither p or q contain codedBs = 0 (no filtering)coefficients; p and q have same reference frame andidentical motion vectors
The filter sample level decision (ap==[1,0] for the left side of the filter, and aq==[1,0] for the right side of the filter) depends on the pixel gradient across block boundaries. The purpose of that decision is to “switch off” the filter when there is a significant change (gradient) across the block boundary or to filter very strongly when there is a very small change (gradient) across the block boundary which is likely to be due to image blocking effect. For example, if the pixel gradient across an edge is below a certain slice threshold (ap/aq=1) then a five tap filter (a strong filter) is applied to filter P0, if not (ap/aq=0) then a three tap filter (a weak filter) is applied. In a single compute unit processors the selection between which of the filters to apply is done using If/else, jump instructions. The sequencer must jump over the second filter instruction stream if the first one is selected or jump over the first one if the second one is selected. These jump (If/else) instructions are acceptable in a single compute unit processors but not in a multi-compute unit processors such as a single instruction multiple data (SIMD) processors.
Since an SIMD processor can solve similar problems in parallel on different sets of local data it can be characterized as n times faster than a single compute unit processor where n is the number of compute units in the SIMD. However, this benefit only is available for sequential types of problems such as FIR, FFT, and DTC, IDCT, etc. The need for SIMD type processing for non-sequential instruction streams is increasing as image size increases.
However, in such multiple compute unit processors where a single sequencer broadcasts a single instruction stream which drives each of the compute units on different local data sets, e.g. the pixel gradient at block boundaries, the conduct of each compute unit may be different, jump/not jump; and to where—depending upon the effect of the common instruction on the individualized local data, and the sequencer cannot take a decision if to jump/not jump that satisfies all the compute units. Therefore, the high speed and efficiency of SIMD processors has not been applied to the family of non-sequential instructions e.g. conditional (if/else, jump) type of problems.
In the current generation of vector SIMD processors this problem can be solved by deriving from a sequence of instructions a generic instruction having an index section and compute section and broadcasting that generic instruction to the multiple compute units, where the index section is applied to localized data stored in each compute unit to select one of a plurality of stored local parameter sets and applying in each compute unit the selected parameters to the local data according to the compute section to produce each compute unit's localized solution to the generic instruction.