The video compression algorithm MPEG-2 requires the calculation of the variance of macroblocks of pixels of a digitized picture. FIG. 1 is a high level diagram of an MPEG-2 compression system, according to the prior art. A description of the MPEG-2 functional blocks (1-11) are provided below.
Block (1) Frame Ordering. This block includes one or more field memories outputting the fields (pictures) in the order required by the MPEG standard. For example, if the input sequence is I B B P B B P, etc., the ordered output sequence will be I P B B P B B, etc. I is a field and/or a semifield (Intra-picture) containing temporal redundancy. P is a field and/or a semifield (Predicted-picture). Referring to the preceding I or P (coded/decoded) picture, the temporal redundancy has been eliminated. B is a field and/or a semifield (Bidirectionally predicted-picture). Referring to the preceding I and successive P (or preceding P and successive P), the temporal redundancy has been eliminated. In either case, the I and P pictures must be considered already coded/decoded.
Each frame buffer in a 4:2:0 format occupies the following memory space:
standard PAL 720 .times. 576 .times. 8 for luminance (Y) = 3,317,760 bits 360 .times. 288 .times. 8 for chrominance (U) = 829,440 bits 360 .times. 288 .times. 8 for chrominance (V) = 829.440 bits Total Y + U + V = 4,976,640 bits standard NTSC 720 .times. 480 .times. 8 for luminance (Y) = 2,764,800 bits 360 .times. 240 .times. 8 for chrominance U = 691,200 bits 360 .times. 240 .times. 8 for chrominance V = 691.200 bits Total Y + U + V = 4,147,200 bits
Block (2) Motion Estimator. This block removes the temporal redundancy from P and B pictures.
Block (3) Discrete Cosine Transform (DCT). This block implements a discrete cosine transform according to the MPEG-2 standard. The I picture and the error pictures P and B are divided in macroblocks of 16 by 16 pixels. These pixels are in turn divided in four blocks of 8 by 8 pixels upon which the discrete cosine transform is performed.
Block (4) Quantizer (Q). An 8 by 8 block resulting from the DCT processing is divided by a quantizing matrix. This in general may change for the different macroblocks of a pictures reducing more or less the amplitude of the DCT coefficients. In such a case, the tendency is to lose the information associated to the highest frequencies less visible to human sight. The result is rearranged and sent to the successive block.
Block (5) Variable Length Coding (VLC). The coded words output by the quantizer tend to contain several null coefficients followed by non-null values. The null values preceding the first non-null value are counted and the result provides the first portion of a word, the second portion of which is the non-null coefficient. Some of these "pairs" tend to assume values more probable than others. The more probable values are coded with relatively short words (2-4 bits), while the less probable values are coded with longer words. Statistically, the number of output bits is reduced compared to the case in which such compressing methods are not implemented.
Block (6) Multiplexer and Buffer. The data generated by the variable length coder for each macroblock, the quantizing matrices, the motion vectors and other syntactic elements are assembled together to construct the final syntax according to the MPEG-2 standard. The stream produced is stored in a memory buffer whose limit dimension is set by the MPEG-2 standard and cannot be expanded. The quantizing block Q ensures that such a limit is followed. This is done by making more or less drastic the quantizing process of the 8 by 8 DCT blocks, depending on the degree of approach to the filling limit of the buffer.
Block (7) Inverse Variable Length Coder (I-VLC). The functions of the VLC block specified above are performed in a reverse order by this block.
Block (8) Inverse Quantizer (I-Q). The words output by the I-VLC block are reordered in 8 by 8 blocks, each being multiplied by the same quantizing matrix that was used for its coding.
Block (9) Inverse DCT (I-DCT). The DCT function is inverted and applied to each 8 by 8 block produced by the quantizing process.
Block (10) Motion Compensation and Frames Storing (Frames Store). At the output of the 8 by 8 I-DCT block the following may be obtained. The decoded I field (or semifield) that must be stored for removing the temporal redundancy from successive P and B pictures. The prediction error P and B field (or semifield) that must be added to the previously removed information during the motion estimation phase. For the case of a P picture, the resulting sum is used during the process of motion estimation of successive P pictures and B pictures. In both cases, the decoded I and/or P pictures are stored in field memories distinct from those (used for frame ordering) defined in the paragraph above.
Block (11) Display Unit. This unit converts the fields from the format 4:2:0 to the format 4:2:2, and generates the interlaced format for the subsequent display of the image.
The functional arrangement of the above-identified blocks according to the architecture implementing the coder is depicted in FIG. 2. The block (1) of frame ordering, the block (10) for storing the reconstructed P and I pictures, and the block (6) for storage of the bit stream produced by the MPEG-2 coding are commonly integrated into dedicated external memories. The integrated circuit accesses these dedicated external memories through a unique interface suitably controlled by an integrated memory controller.
The video core block comprises the pre-processing unit which converts the received pictures from the format 4:2:2 to the format 4:2:0 by filtering and subsampling the chrominance. The post-processing block implements the inverse function during the phase of decoding and displaying. The coding blocks (3, 4, 5) and the decoding blocks (7, 8, 9) are integrated within the Encoding & Decoding Core.
The system controller coordinates the processing that is performed within the device. The system controller also calculates the quantization matrix to be used in the block (4) as a function of the state of filling of the buffer described above for block (6), and of the variance of the source picture macroblock, upstream of the motion estimation processing. Such a system controller may be implemented, as depicted in FIG. 3, by a processor that executes via software its supervising functions. Such a block may integrate a 32 bit core CPU of the Reduced Instruction Set Computer (RISC) type. Plus, the block may integrate the subsystems needed for communicating with the external memory and control buses. A block containing the code to be performed may also be integrated.
In general, the coding algorithm for a frame size of M rows by N columns decomposed in macroblocks of R rows and S columns requires that M/R and N/S be even integers different from each other. Also, the calculation of a T (positive integer number) variance per macroblock, each calculated by starting from a block of size H (a submultiple of R) by K (a submultiple of S), is extracted from the macroblock according to a preestablished scheme with the possibility that each pixel belongs to more than one block, and the condition: T=2* (R/H)*(S/K).
Overall, for each frame it is necessary to perform (M*N)/(R*S)*T variance calculations, each over H*K pixels. For example, for the MPEG-2 algorithm R=S=16, H=K=T=8, M=576, and N=720 (for PAL format pictures). Therefore, 12,960 distinct variances are required, each based on 64 pixels. Such processing is so burdensome that it would require almost the full power of the CPU. In contrast, if such processing is assigned to a hardware accelerator (the VEE block of FIG. 3), a significant saving in terms of the power of calculation requested from the CPU is possible at the expense of a limited silicon area requirement, for implementing a specifically optimized accelerator.
The Variance Concept. The variances to be calculated are derived according to the ensuing description. As described above, the luminance component of a digital picture represented as a matrix of M rows and N columns of pixels, divided in R by S macroblocks which in turn are divided in T subsets or blocks, each of H by K size, is assumed to be known. Each pixel may be coded with 8 bits. For each subset, the variance is defined as follows: ##EQU1##
For PAL format images, M and N are 576 and 720, respectively, and R=S=16, H=K=T=8 according to the standard MPEG-2. Each frame is thereafter divided in macroblocks of 16 by 16 pixels, starting from the upper left corner, as depicted in FIG. 4. The macroblocks are further subdivided in four adjacent 8 by 8 blocks. For each macroblock, eight variances must be calculated. Each variance consisting of the variance of an 8 by 8 subset (block) is derived from the macroblocks according to the manner that will be described later. The above mathematical formula for calculating the variance reduces itself to the following simplified expression that is applied to each subset. The input order of the pixels being from left to right and from the upper row or line to the line below it, though irrelevant in the calculation. Variances are calculated as follows: ##EQU2##
Referring to FIG. 5, the eight subsets (b0, b1, b2, b3, b4, b5, b6, b7) are obtained in the following manner: the first four subsets are the four blocks forming the macroblock; the other four subsets are obtained by separating the even numbered lines or rows of each macroblock from the odd numbered ones. These are the odd field and even field components of each macroblock. Once the variances of the blocks have been calculated, the so-called activity of the macroblock, defined as max(v0, v1, v2, v3, v4, v5, v6, v7)+1, is calculated, wherein v0, . . . , v7 are the eight calculated variances. The activity of each macroblock contributes to the calculation of the quantizing matrix used in the quantizer Q of FIG. 1.
Standard Implementation of the Variance Estimator. The known implementation of the variance estimator, depicted in FIG. 6, includes eight distinct calculation blocks. Each block has the function of calculating, in parallel to the other, a single variance. Each of the eight parallel branches includes a filter/demultiplexer for catching only the pixels belonging to the relative block the variance of which must be calculated, in addition to the variance calculating circuit itself. This architecture has the drawback of replicating, for each variance to be calculated, parts that are functionally identical to each other. For example, the variance calculator always implements the same computation formula. Also, the filter/demultiplexers all perform the same functions, though with different parameters (each catching a selected subset of pixels).