Video compression involves encoding/decoding of pixel information in 16×16 pixels macroblocks. The new emerging standards like (MPEG4, H.264, and Windows Media) provide a flexible tiling structure in a macroblock. It allows the use of 16×16, 16×8, 8×16, 8×8, 8×4, 4×8, and 4×4 sub-macroblock sizes. A filter (de-blocking filter) is applied to every decoded macroblock edge to reduce blocking distortion resulting from the prediction and residual difference coding stages of the decoding process. The filter is applied on both 4×4 block and 16×16 macroblock boundaries, in which three pixels on either side of the boundary may be updated using a five-tap filter. The filter coefficients or “strength” are governed by a content adaptive non-linear filtering scheme. This is done in a number of ways. Windows Media Video decoder (wmv) uses one protocol involving the boundary strength across block boundaries. H.264 or MPEG-4 part 10 uses pixel gradient across block boundaries.
In H.264 the de-blocking filter is applied after the inverse transform in the encoder (before reconstructing and storing the macroblock for future predictions) and in the decoder (before reconstructing and displaying the macroblock). The filter has two benefits: block edges are smoothed, improving the appearance of decoded images (particularly at higher compression ratios). And in the encoder the filtered macroblock is used for motion-compensated prediction of further frames, resulting in a smaller residual after prediction.
Three levels of adaptive filtering (slice, edge, and sample) are applied to vertical or horizontal edges of 4×4 sub-macroblocks in a macroblock, in the following order vertical first and then horizontal. Each filtering operation affects up to three pixels on either side of the boundary. In 4×4 pixel sub-macroblocks there are 4 pixels on either side of a vertical or horizontal boundary in adjacent blocks p and q (p0,p1,p2,p3 and q0,q1,q2,q3). Depending on the coding modes of neighboring blocks and the gradient of image samples across the boundary, several outcomes are possible, ranging from (a) no pixels are filtered to (b) p0, p1, p2, q0, q1, q2 are filtered to produce output pixels P0, P1, P2, Q0, Q1 and Q2.
The choice of filtering outcome depends on the boundary block strength (edge level) parameter and on the gradient of image samples across the boundary (sample level). The boundary strength parameter Bs is chosen according to the following rules:
p or q is (intra coded andBs = 4P0, P1, P2,boundary is a macroblock(strongestQ0, Q1, Q2boundary)filtering)p or q is intra coded andBs = 3P0, P1,boundary is not a macroblockQ0, Q1boundaryneither p or q is intra coded;Bs = 2P0, P1,p or q contain codedQ0, Q1coefficientsneither p or q is intra coded;Bs = 1P0, P1,neither p or q contain codedQ0, Q1coefficients; p and q havedifferent reference frames or adifferent number of referenceframes or different motionvector valuesneither p or q is intra coded;Bs = 0neither p or q contain coded(no filtering)coefficients; p and q have samereference frame andidentical motion vectors
The filter is “stronger” at places where there is likely to be significant blocking distortion, such as the boundary of an intra coded macroblock or a boundary between blocks that contain coded coefficients.
The filter sample level decision (ap==[1,0] for the left side of the filter, and aq==[1,0] for the right side of the filter) depends on the pixel gradient across block boundaries. The purpose of that decision is to “switch off” the filter when there is a significant change (gradient) across the block boundary or to filter very strongly when there is a very small change (gradient) across the block boundary which is likely to be due to image blocking effect. For example, if the pixel gradient across an edge is below a certain slice threshold (ap/aq=1) then a five tap filter (a strong filter) is applied to filter P0, if not (ap/aq=0) then a three tap filter (a weak filter) is applied. In slow single compute unit processors the selection between which of the filters to apply is done using If/else, jump instructions. The sequencer must jump over the second filter instruction stream if the first one is selected or jump over the first one if the second one is selected. These jump (If/else) instructions are acceptable in slower single compute unit processors but not in fast (deep pipelined) single compute unit processors and/or multi-compute unit processors such as a single instruction multiple data (SIMD) processors.
Since an SIMD processor can solve similar problems in parallel on different sets of local data it can be characterized as n times faster than a single compute unit processor where n is the number of compute units in the SIMD. However, this benefit only is available for sequential types of problems such as FIR, FFT, and DTC, IDCT, etc. The need for SIMD type processing for non-sequential instruction streams is increasing as image size increases.
However, in such multiple compute unit processors where a single sequencer broadcasts a single instruction stream which drives each of the compute units on different local data sets, e.g. the pixel gradient at block boundaries, the conduct of each compute unit may be different, jump/not jump; and to where—depending upon the effect of the common instruction on the individualized local data, and the sequencer cannot take a decision if to jump/not jump that satisfies all the compute units. Therefore, the high speed and efficiency of SIMD processors has not been applied to the family of non-sequential instructions e.g. conditional (if/else, jump) type of problems.