The present invention relates to the field of video processing, and more particularly, to an algorithm and an architecture of a motion estimator for implementing video coders compliant with the MPEG-2 standard.
The concept of motion estimation is that a set of pixels of a field of a picture may be placed in a position of the subsequent picture obtained by translating the preceding one. These transpositions of objects may expose to the video camera parts that were not visible before, as well as changes of their shape, e.g., zooming. The family of algorithms that identify and associate these portions of images is generally referred to as motion estimation. Such an association allows calculation of the portion of a different image by removing the redundant temporal information making more effective the subsequent process of compression by transformation, quantization and entropic coding.
A typical example of the motion estimation is found in the MPEG-2 standard. A typical block diagram of a video MPEG-2 coder is depicted in FIG. 1. Such a system is made up of the following functional blocks:
1) Field Ordinator. This blocks is composed of one or several field memories outputting the fields in the coding order required by the MPEG standard. For example, if the input sequence is I B B P B B P etc., the output order will be I P B B P B B . . . I is the Intra-coded picture, and is a field and/or a semi-field containing temporal redundancy. P is the Predicted-picture, and is a field and/or semi-field from which the temporal redundancy with respect to the preceding I or P picture (previously coded/decoded) has been removed. B is the Bidirectionally predicted-picture, and is a field and/or a semi-field whose temporal redundancy with respect to the preceding I and subsequent P (or preceding P and successive P) pictures has been removed. In both cases, the I and P pictures must be considered as already coded/decoded.
Each frame buffer in the format 4:2:0 occupies the following memory space:
2) Motion Estimator. This block removes the temporal redundancy from the P and B pictures.
3) DCT. This block implements the discrete cosine transform according to the MPEG-2 standard. The I picture and the error pictures P and B are divided in 8*8 blocks of portions Y, U, V on which the DCT transform is performed.
4) Quantizer (Q). An 8*8 block resulting from the DCT transform is then divided by a quantizing matrix to reduce the magnitude of the DCT coefficients. The information associated with the highest frequencies less visible to human sight tend to be removed. The result is reordered and sent to the successive block.
5) Variable Length Coding (VLC). The coded words output from the quantizer tend to contain a large number of null coefficients, followed by non-null values. The null values preceding the first non-null value are counted. The count figure comprises the first portion of a coded word, the second portion of which represents the non-null coefficients. These paired values tend to assume values more probable than others. The most probable ones are coded with relatively short words (composed of 2, 3 or 4 bits), while the least probable are coded with longer words. Statistically, the number of output bits is less than in the case such methods are not implemented.
6) Multiplexer and Buffer. Data generated by the variable length coder, the quantizing matrices, the motion vectors, and other syntactic elements are assembled for constructing the final syntax considered by the MPEG-2 standard. The resulting bitstream is stored in a memory buffer, the limit size of which is defined by the MPEG-2 standard and cannot be overfilled. The quantizer block Q supports the buffer limit by making the division of the DCT 8*8 blocks dependent upon the filling limit of a memory buffer of the system. The quantizer block Q also supports the buffer limit by making the division of the DCT 8*8 blocks dependent upon the energy of the 8*8 source block taken upstream of the motion estimation and DCT transformation process.
7) Inverse Variable Length Coding (I-VLC). The variable length coding functions specified above are executed in an inverse order.
8) Inverse Quantization (IQ). The words output by the I-VLC block are reordered in the 8*8 block structure, which is multiplied by the same quantizing matrix that was used for its preceding coding.
9) Inverse DCT (I-DCT). The DCT transform function is inverted and applied to the 8*8 block output by the inverse quantization process. A change is made from the domain of spatial frequencies to the pixel domain.
10) Motion Compensation and Storage. At the output of the I-DCT block the following pictures may alternately be present. A decoded I picture or semipicture that must be stored in a respective memory buffer for removing the temporal redundancy with respect to subsequent P and B pictures. A decoded prediction error picture or semipicture P or B that must be summed to the information removed previously during the motion estimation phase. In the case of a P picture, such a resulting sum stored in a dedicated memory buffer is used during the motion estimation process for the successive P pictures and B pictures. These field memories are generally distinct from the field memories used for re-arranging the blocks.
11) Display Unit. This unit converts the pictures from the format 4:2:0 to the format 4:2:2, and generates the interlaced format for displaying the images.
An architecture implementing the above-described coder is shown in FIG. 2a. A distinctive feature is that the field rearrangement block (1), the block (10) for storing the already reconstructed P and I pictures, and the block (6) for storing the bitstream produced by the MPEG-2 coding are integrated in memory devices external to the integrated circuit comprising the core of the coder. The decoder accesses the external memory device through a single interface, suitably managed by an integrated controller.
Furthermore, the preprocessing block converts the received images from the format 4:2:2 to the format 4:2:0 by filtering and sub-sampling the chrominance. The post-processing block implements a reverse function during the decoding and displaying phase of the images. The coding phase also employs the decoding for generating the reference pictures to make operative the motion estimation. For example, the first I picture is coded, decoded, stored (as described in block description 10), and is used for calculating the prediction error that will be used to code the subsequent P and B pictures. The play-back phase of the data stream previously generated by the coding process uses only the inverse functional blocks (I-VLC, I-Q, I-DCT, etc.), never the direct functional blocks. From this perspective, the coding and the decoding implemented for the subsequent displaying of the images are nonconcurrent processes within the integrated architecture.
A description of the non-exhaustive search motion estimator is provided in the following paragraphs by considering two fields of an image. The following description also applies to semifields of the image. The two fields are Q1 at the instant t, and the subsequent field Q2 at the instant t+T, where T is the field period ({fraction (1/25)} sec. for the PAL standard, {fraction (1/30)} sec. for the NTSC standard). Q1 and Q2 comprise luminance and chrominance components. Assume that motion estimation is applied only to the most information rich component, such as the luminance. The luminance is represented as a matrix of N lines and M columns. Q1 and Q2 are then divided in portions called macroblocks, each macroblock having R lines and S columns.
The results of the divisions N/R and M/S must be two integer numbers, not necessarily equal to each other. MB2(i,j) is a macroblock defined as the reference macroblock belonging to the field Q2. A first pixel in the top left part of the field Q2 is at the intersection between the i-th line and the j-th column. The pair (i,j) is characterized by the fact that i and j are integer multiples of R and S, respectively.
FIG. 2b shows how the reference macroblock is positioned in the Q2 picture. The horizontal dashed-lines with arrows indicate the scanning order for identifying the various macroblocks on Q2. MB2(i,j) is projected on the field Q1 to obtain MBl(i,j). A search window is defined on Q1 having its center at (i,j), and comprises macroblocks MBk(l,p). K is the macroblock index. The k-th macroblock is identified by the coordinates (e,f), such that:
xe2x88x92pxe2x89xa6(exe2x88x92i)xe2x89xa6+pxe2x88x92qxe2x89xa6(f, j)xe2x89xa6+q
with e and f being integer numbers.
Each of the macroblocks are said to be a predictor of MB2(i,j). For example, if p=8 and q=16, the number of predictors is (2p+1)*(2q+1)=561. For each predictor, the norm L1 with respect to the reference macroblock is calculated. Such a norm is equal to the sum of the absolute values of the differences between common pixels belonging to MB2(i,j) and to MBk(l,p). Each sum contributes an R*S value, where the result is called distortion. Therefore, (2p+1)*(2q+1) values of distortion are obtained, among from which the minimum value is chosen. This identifies a prevailing position (e*f*).
The motion estimation process is not yet terminated because in the vicinity of the prevailing position, a grid of pixels is created for interpolating those that form Q1. For example, Q1 comprises:
. . . . . . . . . . . . . . . . . . .
p31 p32 p33 p34 p35 . . . .
p41 p42 p43 p44 p45 . . . . .
. . . . . . . . . . . . . . . . . . .
After interpolation, the following is obtained:
The above noted algorithm is applied in the vicinity of the prevailing position by assuming, for example, p=q=1. In such a case, the number of predictors is equal to 8, and are formed by pixels that are interpolated by starting from the pixels of Q1. The predictor with minimum distortion with respect to MB2(i,j) is then identified. The predictor more similar to MB2(i,j) is identified by the coordinates of the prevailing predictor through the above noted two steps of the algorithm.
The first step tests only whole positions, while the second step tests the sub-pixel positions. The vector comprising difference components between the position of the prevailing predictor and of MB2(j,j) is defined as the motion vector. This describes how MB2(i,j) derives from a translation of a macroblock similar to it in the preceding field. Other measurements may be used for establishing if two macroblocks are similar to each other. For example, the sum of the quadratic values of the differences (norm L2) is one such measurement. Furthermore, the sub-pixel search window may be wider than that specified in the above example. Accordingly, all this further increases the complexity of the motion estimator.
In the example described above, the number of executed operations per pixel is equal to 561+8=569, wherein each operation includes a difference between two pixels, plus an absolute value identification, plus a storage of the calculated result between the pair of preceding pixels of the same macroblock. This means that for identifying the optimum predictor 569*R*S, parallel operators are required at the pixel frequency of 13.5 MHZ. By assuming R=S=16, as defined by the MPEG-2 standard, the number of operations required is 569*16*16=145,664.
Each operator may function on a time division basis on pixels belonging to different predictors. Therefore, if each operator operates at a frequency of 4*13.5=54 MHZ, the number of operators required would be 145,664/4=36,416. A high level block diagram of a prior art motion estimator based on an exhaustive search technique is shown in FIG. 3a. The DEMUX block conveys the data provided by the field memory to the operators, and the MIN block operates on the set of distortion values.
The object of the present invention is to reduce the complexity of a motion estimator, as used in an MPEG-2 video coder. As an illustration of an efficient implementation of the method and architecture of the motion estimator of the present invention, a coder for the MPEG-2 standard is considered.
Using the motion estimator of the invention, it is possible, for example, to use only 6.5 operations per pixel to find the best predictor of the portion of picture currently being subjected to motion estimation. This is applicable to an SPML compressed video sequence of either the PAL or NTSC type. In contrast, the best result that may be obtained with a motion estimator of the prior art would imply the execution of 569 operations per pixel. This is in addition to the drawback of requiring a more complex architecture.
The method of the invention implies a slight loss of quality of the reconstructed video images for the same compression ratio. However, such a degradation of the images is practically undetectable to human sight because the artifacts are distributed in regions of the images having a substantial motion content. The details of which practically pass unnoticed by the viewer.