The invention relates to the field of electronics, and, more particularly, to a video encoder.
Motion estimation is based upon placing a set of pixels of a certain field of a picture in a position of the same field of the successive picture. This is done by translating the preceding picture. Consequently, the transpositions of objects may expose to the video camera parts of the picture that were not visible before, as well as changes of their shape, e.g., zooming, etc.
The family of algorithms suitable to identify and associate these portions of images is generally referred to as motion estimation. Such an association permits calculation of a portion of the difference image by removing the redundant temporal information, which makes more effective the subsequent processing of the compression by DCT, quantization and entropy coding.
The MPEG-2 standard provides an example of the method as discussed above. A typical block diagram of a video MPEG-2 coder is depicted in FIG. 1. Such a system is made up of the following functional blocks.
1) Field ordinator. This block is formed of one or several field memories outputting the fields in the coding order required by the MPEG standard. For example, if the input sequence is I B B P B B P etc., the output order will be I P B B P B B . . . , where the I, P and B fields are described below.
I(Intra coded picture) is a field and/or a semifield containing temporal redundance.
P(Predicted-picture) is a field and/or semifield from which the temporal redundance with respect to the preceding I or P fields (previously co-decoded) has been removed.
B(Bidirectionally predicted-picture) is a field and/or a semifield whose temporal redundance with respect to the preceding I and subsequent P fields (or preceding P and successive P fields) has been removed. In both cases the I and P pictures must be considered as already co/decoded.
Each frame buffer in the format 4:2:0 occupies the following memory space:
2) Motion estimator. This block removes the temporal redundance from the P and B pictures.
3) DCT. This block implements the discrete-cosine transform (DCT) according to the MPEG-2 standard. The I picture and the error pictures P and B are divided in 8*8 blocks of pixels Y, U, V on which the DCT transform is performed.
4) Quantizer Q. An 8*8 block resulting from the DCT transform is then divided by a quantizing matrix to reduce the magnitude of the DCT coefficients. The information associated with the highest frequencies less visible to human sight tends to be removed. The result is reordered and sent to the successive block.
5) Variable Length Coding (VLC). The codification words output from the quantizer tend to contain a large number of null coefficients followed by nonnull values. The null values preceding the first nonnull value are counted, and the count figure forms the first portion of a codification word. The second portion of which represents the nonnull coefficient.
These paired values tend to assume values more probable than others. The most probable ones are coded with relatively short words composed of 2, 3 or 4 bits while the least probable are coded with longer words. Statistically, the number of output bits is less than in the case that such methods are not implemented.
6) Multiplexer and Buffer. Data generated by the variable length coder, the quantizing matrices, the motion vectors and other syntactic elements are assembled for constructing the final syntax processed by the MPEG-2 standard. The resulting bit stream is stored in a memory buffer, the limit size of which is defined by the MPEG-2 standard and cannot be overfilled. The quantizer block Q supports the limit by adjusting the division of the DCT 8*8 blocks depending on the size of such a memory buffer, and on the energy of the 8*8 source block taken upstream of the motion estimation and DCT transform process.
7) Inverse Variable Length Coding (I-VLC). The variable length coding functions specified above are executed in an inverse order.
8) Inverse Quantization (IQ). The words output by the I-VLC block are reordered in the 8*8 block structure, which is multiplied by the same quantizing matrix used for its preceding coding.
9) Inverse DCT (I-DCT). The DCT transform function is inverted and applied to the 8*8 block output by the inverse quantization process. This permits changing from spatial frequency domain to the pixel domain.
10) Motion Compensation and Storage. At the output of the I-DCT block the following may alternatively be present. A decoded I picture (or semipicture) be stored in a respective memory buffer for removing the temporal redundance with respect thereto from subsequent P and B pictures. A decoded prediction error picture (semipicture) P or B that must be summed to the information removed previously during the motion estimation phase. In case of a P picture, such a resulting sum stored in a dedicated memory buffer is used during the motion estimation process for the successive P pictures and B pictures. These field memories are generally distinct from the field memories that are used for re-arranging the blocks.
11) Display Unit. This unit converts the pictures from the format 4:2:0 to the format 4:2:2, and generates the interlaced format for displaying the images. The functional blocks depicted in FIG. 1 are provided in an architecture implementing the above described coder, as shown in FIG. 2a. The field ordinator block, the motion compensation and storage block for storing the already reconstructed P and I pictures, and the multiplexer and buffer blocks for storing the bitstream produced by the MPEG-2 coding are integrated in memory devices external to the integrated circuit forming the core of the coder. The decoder accesses the external memory (DRAM) through a single interface managed by an integrated controller.
Moreover, the preprocessing block converts the received images from the format 4:2:2 to the format 4:2:0 by filtering and subsampling the chrominance. The post-processing block implements a reverse function during the decoding and displaying phase of the images. The coding phase also uses a decoding step for generating the reference pictures to make the motion estimation operative. For example, the first I picture is coded, then it is decoded and stored as described above. The first I picture is used for calculating the prediction error that will be used to code the subsequent P and B pictures.
The playback phase of the data stream previously generated by the coding process uses only the inverse functional blocks (I-VLC, I-Q, I-DCT, etc.) and not the direct functional blocks. In other words, the coding and the decoding implemented for the subsequent displaying of the images are nonconcurrent processes within the integrated architecture. The purpose or performance of the motion algorithm estimation is that of predicting images/semifields in a sequence. These sequences are obtained as a composition of whole pixel blocks referred to as predictors, which are originated from preceding or future images/semifields.
The MPEG-2 standard includes three types of pictures/semifields:
I pictures (Intra coded picture) are pictures that are not submitted to motion estimation. They contain temporal redundancy and are fundamental for the picture coding of the other two types.
P pictures (predicted picture) are the pictures whose temporal redundancy has been removed through the motion estimation with respect to the I or P pictures preceding them.
B pictures (Bidirectionally predicted picture) are the pictures whose temporal redundancy has been removed through the motion estimation with respect to the I and P pictures preceding them and/or those that are to follow.
According to an exhaustive-search motion estimator, the process is as follows. P field or semifield: Two fields of a picture are considered: Q1 at the instant t and the subsequent field Q2 at the instant t+(kp)*T. The same applies to the semifields. The term kp is a constant dependent on the number of B fields existing between the preceding I and the subsequent P (or between two P), and T is the field period of {fraction (1/25)} sec. for the PAL standard and {fraction (1/30)} sec. for the NTSC standard. Q1 and Q2 are formed by luminance and chrominance components. Assume that the motion estimation is applied only to the most energetic and, therefore, richer information components such as the luminance. The luminance can be represented as a matrix of N lines and M columns. Q1 and Q2 are divided into portions called macroblocks, each having R lines and S columns.
The results of the divisions N/R and M/S must be two integer numbers, but not necessarily equal to each other. Mb2(i,j) is a macroblock defined as the reference macroblock belonging to the field Q2 whose first pixel in the top left part thereof is at the intersection between the i-th line and the j-th column. The pair (i,j) is characterized by the fact that i and j are integer multiples of R and S, respectively.
FIG. 2b shows how the reference macroblock is positioned on the Q2 picture, while the horizontal dash line arrows indicate the scanning order used for identifying the various macroblocks on Q2. MB2(i,j) may be projected on the Q1 field to obtain MB1(i,j). A search window is defined on Q1 having its center at (i,j) and includes the macroblocks MBk[e,f], where k is the macroblock index. The k-th macroblock is identified by the coordinates (e,f), such that xe2x88x92p less than =(exe2x88x92i) less than =+p, and xe2x88x92q less than =(fxe2x88x92j) less than =+q. The indexes e and f are integer numbers.
Each macroblock is a possible predictor of MB2(i,j). The various motion estimation algorithms differ from each other depending on the way the predictors are searched and selected in the search window. The predictor that minimizes a certain cost function is chosen among the whole set of possible predictors. Such a function may vary according to the selected motion estimation algorithm. For example, in the case of the MPEG-2 standard, the predictor that minimizes the L1 norm with respect to the reference macroblock is searched. This norm is equal to the sum of the absolute values of the differences among common pixels belonging to MB2(i,j) and to Mbk(e,f), respectively. R*S values contribute to each sum, the resulting value of which is called distortion.
The predictor most similar to MB2(i,j) is now identified by the coordinates of the prevailing predictor at the end of the motion estimation step. The vector formed by the components resulting from the difference between the position of the prevailing predictor and MB2(i,j) is referred to as the motion vector. This describes how MB2(i,j) is derived from a translation of a similar macroblock in the preceding field.
B field or semifield. Three fields of a picture are considered: Qpnxe2x88x921 at the instant t, QBkB at the instant t+(kB)*T and QPn at the instant t+(kp)*T. The variables kp and kB are dependent on the preselected number of B fields. The same may be applied also to semifields. T is the field period, which is {fraction (1/25)} sec. for the PAL standard and {fraction (1/30)} sec. for the NTSC standard. Qpnxe2x88x921, QBkB and QPn are formed by luminance and chrominance components. Assume that the motion estimations are only applied to the most energetic and, therefore, richer of information components. These components include luminance, which is represented as a matrix of N lines and M columns. QPnxe2x88x921, QBkB and QPn are divided in portions called macroblocks, each having R lines and S columns.
The results of the divisions N/R and M/S must be two integer numbers, but not necessarily equal. Assume MB2(i,j) is a macroblock defined as the reference macroblock belonging to the field Q2. The first pixel of this macroblock is in the top left part thereof at the intersection between the i-th line and the j-th-column. The pair (i,j) is characterized by the fact that i and j are integer multiples of R and S, respectively.
Assume that the projection of MB2(i,j) on the Qpnxe2x88x921 field, MB1(i,j) is obtained and assume that for the projection of MB2(i,j) on the Qpn field, MB3(i,j) is obtained. A search window is defined on QPnxe2x88x921 with its center at (i,j), and includes the macroblocks MB1k[e,f]. On QPn a similar search window whose dimension may be different, or in any case predefined, is made up by MB3k[e,f]. The variable k is the macroblock index. The k-th macroblock on the QPnxe2x88x921 field is identified by the coordinates (e,f), such that xe2x88x92p less than =(exe2x88x92i) less than =+p1, and xe2x88x92q1 less than =(fxe2x88x92j) less than =+q1. The k-th macroblock on the QPn field is identified by the coordinates (e,f) such that xe2x88x92p3 less than =(exe2x88x92i) less than =+p3, and xe2x88x92q3 less than =(fxe2x88x92j) less than =+q3. The indexes e and f are integer numbers.
Each of the macroblocks are a predictor of MB2(i,j). Consequently, there are in this case two types of predictors for MB2(i,j). The first type are those obtained on the field that temporally precedes the one containing the block to be estimated (I or P). These are referred to as forward predictors. The second type are those that are obtained on the field that temporally follows the one containing the block to be estimated (I or P). These are referred to as backward predictors.
Among the predictors belonging to the two sets of possible predictors, two predictors are selected. One is a backward predictor, and one is a forward predictor. These minimize a certain cost function of the hardware implementation. These sets depend on the type of motion estimation algorithm in use. This function may vary depending on the type of motion estimation that is chosen. For example, if the MPEG-2 standard is chosen, the predictor that minimizes the L1 norm with respect to the reference macroblock is searched. Such a norm is equal to the sum of absolute values of the differences among common pixels belonging to MB2(i,j) and MB1k(e,f) onto MB3k(e,f), respectively. The R*S values contribute to each sum whose result is called distortion.
A certain number of forward distortion values are obtained among which the lowest is chosen. This is done by identifying a prevailing position (ef, ff) on the field QPnxe2x88x921, and a certain number of backward distortion values among which again the lowest value is chosen by identifying a new prevailing position (eb, fb) on the Qpn field. Moreover, the distortion value between MB2(i,j) and a theoretical macroblock obtained by linear interpolation of the two prevailing predictors is calculated.
MB2(i,j) may be estimated by using three types of macroblocks: the forward predictor (ef, ff), the backward predictor (eb, fb) or an average of both. The vector formed by the difference components are defined as the motion vectors and describe how MB2(i,j) derives from a translation of a macroblock similar to it in the preceding field and/or in the successive field. The difference components are between the position of the prevailing predictor(s) and MB2(i,j)
An exhaustive search motion estimator and a hierarchical recursive motion estimator are described in European Patent Application No. 98830163.6, which is assigned to the assignee of the present invention. The known motion estimators perform a motion estimate by considering in a linear way the pictures from the first of a certain sequence to the last of the sequence. During the sequence, changes of a scene occur at some points of the picture. The motion estimation process carried out by known estimators attempts to predict the picture with respect to preceding and/or to the successive picture even though they belong now to different scenes and are therefore largely uncorrelated to the picture under estimation. As a result of this, the predicted picture inevitably will contain blocks of pixels that belong to a different scene. This causes a significant drawback, and undesirably reduces the performance that may be obtained from known motion estimators when changes in a scene occur during a sequence of video images.
The above described drawbacks and disadvantages of the known motion estimators are overcome by the present invention by providing a motion estimation algorithm for coding systems of video pictures in which reference fields (intra coded pictures) are generated and stored to be used for dynamically calculating the prediction error in coding successive P and/or B pictures. P pictures are the predicted pictures and B pictures are the bidirectionally predicted pictures. The temporal redundance is removed through the motion estimation with respect to preceding I or P pictures for P pictures, or preceding and/or successive I or P pictures for B pictures.
The method of the invention is based on identifying an occurred change of a scene by monitoring certain parameters on a certain number of pictures temporally in sequence. An assertion signal of an occurred change of the scene is used to limit the use of the forward motion estimation for images preceding the change of scene. The assertion signal of an occurred change of the scene is also used to limit the backward estimation for pictures following the change of scene between two Intra pictures.
The method for motion estimation from successive picture fields for coding systems of video pictures includes decoding of pictures by utilizing stored reference pictures (Intra coded pictures) used to dynamically calculate a prediction error for coding successive pictures (Predicted picture and Bidirectionally predicted pictures). The temporal redundance is removed through a motion estimation with respect to preceding I or P pictures (Predicted pictures) or preceding them and/or following them (Bidirectionally predicted picture). This is carried out by macroblocks of pixels in which the pictures are divided.
The method preferably includes the steps of defining and calculating a smoothness index of a motion field of each picture by analyzing the motion vectors for all the macroblocks of a subdivision of the picture, except are the peripheral macroblocks. The method further includes calculating the average smoothness index for a certain preestablished number of last processed pictures, monitoring the number of macroblocks belonging to the same picture that are coded as intra macroblocks, and identifying a change of scene from a combination of the number of macroblocks belonging to the same picture coded as higher intra compared to a certain threshold number and/or of the variation of the smoothness index of two successive pictures compared to a top and a bottom threshold.
This threshold is based on the average value of the smoothness index of the preestablished number of last processed pictures. The method preferably further includes the step of using the identification of a change of scene to command the prediction computation of the pictures using only a forward motion estimation for pictures preceding the change of scene, and using only a backward motion estimation for picture subsequent the change of scene.