The present invention relates to transmission of digitized pictures, and, more particularly, to a picture compression technique based on motion estimation algorithms, such as those implemented in MPEG2 video encoders.
World TV standards differ from one another in picture size and the way pictures are transmitted, e.g., frame rate, frequency band, etc. Usually, a motion picture is recorded by a camera at 24 photograms/second. In order for the film to be transmitted with the PAL or SECAM system, which requires a speed of 25 pictures/second, the film is slightly accelerated. This results in minor sound and motion distortions which are not perceived by a TV viewer. However, when the film must be transmitted with the NTSC TV standard, which requires a frame rate of 30 pictures/second, such an acceleration might produce distortions that would be perceived by the TV viewer. In this case, it is necessary to implement a 3:2 pulldown technique which transforms a film recorded at 24 photograms/second to a TV sequence of 30 pictures/second.
As shown in FIG. 1, a motion picture is recorded by a common photogram, that is, each picture is acquired as a whole in one instant. In contrast, a television picture is acquired in two distinct instants. A first instant is acquisition of the even lines, which makes up the even semifield (top field) of the picture. A second instant is acquisition of the odd lines, which makes up the odd semifield (bottom field). The sum of these two semifields make up the whole picture, which is also referred to as a video frame. In view of the fact that some time passes between the acquisition of the first and second semifields, there may be relative movements among the objects focused by the video camera. Consequently, an object may occupy slightly different positions in the two fields.
Even a photogram may be divided in two fields. The two fields are the even lines forming the top field and the odd lines forming the bottom field. Since the picture is acquired in a single instant, the focused objects will occupy the same position in both fields. The 3:2 pulldown conversion method from 24 to 30 photograms/second includes transforming a sequence of 4 photograms of the film in a sequence of 5 TV frames through the duplication of some fields according to the scheme illustrated in FIG. 2. In FIG. 2, Top i and Bot i indicate the top field (even lines) and the bottom field (odd lines), respectively, of the photogram i, where i=1,2,3,4. This repetition of fields causes artifacts that do not disturb the TV viewer because the frames repeat themselves at 33 ms intervals.
Since the fundamental of picture compression methods is to reduce the amount of information that must be transmitted or recorded, the encoding of repetitive fields could be avoided by coding 5 pictures with the same number of bits that would be required by 4 pictures upon detecting whether the sequence contains a 3:2 pulldown transformation. Hence, it will be a function of the decoder (the receiving station) to reconstruct the sequence of 5 pictures according to the above described scheme. For the same quality of the pictures the compression ratio may be increased, or for the same number of bits available for coding the pictures it is possible to improve their quality.
A reliable detection of the 3:2 pulldown may also permit a further improvement of the coding quality for compression algorithms that use motion estimators based on the correlation of generated motion fields. However, the field repetition according to a 3:2 pulldown scheme may cause inconsistencies in the global motion of the pictures. When a field is repeated, the motion estimator senses this as a stop of the motion (even if fictitious) of the picture""s objects, thus verifying the convergence process of the generated vectors.
A further beneficial consequence of the detection of the 3:2 pulldown is that its detection will render useless all the predictions permitted by the encoding method, e.g., field prediction, dual prime prediction and frame prediction, etc. Detection of the 3:2 pulldown is also an advantage for the methods that take into consideration the lack of movement among the picture fields, such as the frame prediction methods.
To detect the 3:2 pulldown, an analysis of the motion fields generated by the coding system for an acceptable motion estimation among the fields becomes essential. The basic concept of motion estimation is the following. A set of pixels of a field of a picture may be placed in a position of the subsequent picture obtained by translating the preceding one. These transpositions of objects may expose to the video camera parts that were not visible before, as well as changes of their shape, e.g., zooming.
The family of algorithms suitable to identify and associate these portions of images is generally referred to as motion estimation. Such an association permits calculation of a difference image. This removes the redundant temporal information making more effective the subsequent process of compression by discrete cosine transform (DCT), quantization and entropic coding.
Such a method is found in the MPEG2 standard. Systems of motion estimation as well as the architecture of the present invention are equally useful and are readily applicable to systems for manipulating digitized pictures operating according to a standard different from the MPEG2 standard. A typical block diagram of a video MPEG2 decoder is depicted in FIG. 3. Such a system is made up of the following functional blocks.
Field ordinator. This block is composed of one or several field memories outputting the fields in the coding order required by the MPEG standard. For example, if the input sequence is I B B P B B P etc., the output order will be I P B B P B B . . . .
The intra coded picture I is a field and/or a semifield containing temporal redundancy. The predicted-picture P is a field and/or semifield from which the temporal redundancy with respect to the preceding I or P (previously co-decoded) picture has been removed. The bidirectionally predicted picture B is a field and/or a semifield whose temporal redundancy with respect to the preceding I and subsequent P (or preceding P and successive P) picture field has been removed. In both cases, the I and P pictures must be considered as already co/decoded.
Each frame buffer in the format 4:2:0 occupies the following memory space:
Motion Estimator. This block removes the temporal redundancy from the P and B pictures.
DCT. This block implements the cosine-discrete transform according to the MPEG-2 standard. The I picture and the error pictures P and B are divided in 8*8 blocks of pixels Y, U, V on which the DCT transform is performed.
Quantizer Q. An 8*8 block resulting from the DCT transform is divided by a quantizing matrix to reduce the magnitude of the DCT coefficients. The information associated to the highest frequencies, less visible to human sight, tends to be removed. The result is reordered and sent to the successive block.
Variable Length Coding (VLC). The codification words output from the quantizer tend to contain a large number of null coefficients, followed by non-null values. The null values preceding the first non-null value are counted and the count figure forms the first portion of a codification word. The second portion of which represents the non-null coefficient.
These paired values tend to assume values more probable than others. The most probable ones are coded with relatively short words (composed of 2, 3 or 4 bits) while the least probable are coded with longer words. Statistically, the number of output bits is less than when these methods are not implemented.
Multiplexer and Buffer. Data generated by the variable length coder, the quantizing matrices, the motion vectors and other syntactic elements are assembled for constructing the final syntax processed by the MPEG-2 standard. The resulting bitstream is stored in a memory buffer, the limit size of which is defined by the MPEG-2 standard and cannot be overfilled. The quantizer block Q supports such a limit by making the division of the DCT 8*8 blocks dependant on how far the memory buffer of the system is filled, and on the energy of the 8*8 source block taken upstream of the motion estimation and DCT transform process.
Inverse Variable Length Coding (I-VLC). The variable length coding functions specified above are executed in an inverse order.
Inverse Quantization (IQ). The words output by the I-VLC block are reordered in the 8*8 block structure, which is multiplied by the same quantizing matrix that was used for its preceding coding.
Inverse DCT (I-DCT). The DCT transform function is inverted and applied to the 8*8 block output by the inverse quantization process. This permits passing from the domain of spatial frequencies to the pixel domain.
Motion Compensation and Storage. At the output of the I-DCT block the following may alternatively be present. A decoded I picture (or semipicture) that must be stored in a respective memory buffer for removing the temporal redundancy with respect thereto from subsequent P and B pictures. A decoded prediction error picture (semipicture) P or B must be summed to the information removed previously during the motion estimation phase. In case of a P picture, such a resulting sum is stored in a dedicated memory buffer and is used during the motion estimation process for the successive P pictures and B pictures. These field memories are generally distinct from the field memories used for re-arranging the blocks.
Display Unit. This unit converts the pictures from the format 4:2:0 to the format 4:2:2 and generates the interlaced format for displaying the images. The arrangement of the functional blocks depicted in FIG. 4 is an architecture implementing the above-described MPEG-2 decoder shown in FIG. 2. A distinctive feature is in the fact that the field rearrangement block, the motion compensation and storage block for storing the already reconstructed P and I pictures, and the multiplexer and buffer block for storing the bitstream produced by the MPEG-2 coding are integrated in memory devices external to the integrated circuit of the decoder. These memory devices are accessed through a single interface managed by an integrated controller.
Moreover, the preprocessing block converts the received images from the format 4:2:2 to the format 4:2:0 by filtering and subsampling the chrominance components. The post-processing block implements a reverse function during the decoding and displaying phase of the images. The coding process employs also a decoding step for generating the reference pictures to make operative the motion estimation. For example, the first I picture is coded, then decoded, stored and used for calculating the prediction error that will be used to code the subsequent P and B pictures.
The play-back process of the bit stream previously generated by the coding process uses only the inverse functional blocks (I-VLC, I-Q, I-DCT, etc.) and not the direct functional blocks. From this point of view, the coding and the decoding implemented for the subsequent displaying of the images are nonconcurrent processes within the integrated architecture. The scope of motion estimation algorithms is to predict pictures/semipictures in a sequence obtaining the composition of whole blocks of pixels, referred to as predictors, originating from preceding and or future pictures or semifields.
According to the MPEG-2 standard, there are three types of pictures (fields) or semifields. The intra coded picture I is a field and/or a semifield containing temporal redundance. The predicted-picture P is a field and/or semifield from which the temporal redundance with respect to the preceding I or P (previously co/decoded) has been removed. The bidirectionally predicted-picture B is a field and/or a semifield whose temporal redundancy with respect to the preceding I and subsequent P (or preceding P and successive P) has been removed. In both cases, the I and P pictures must be considered as already co/decoded.
The P field or semifield will now be discussed in greater detail. Two fields of a picture will now be considered: Q1 at the instant t and the subsequent field Q2 at the instant t+(kp)*T. The following discussion also applies to the semifields. T is the field period and kp a constant dependent on the number of B fields existing between the preceding I and the subsequent P (or between two P). The field period T is {fraction (1/25)} sec. for the PAL standard, and {fraction (1/30)} sec. for the NTCS standard. Q1 and Q2 are formed by luminance and chrominance components. It is assumed that the motion estimation is applied only to the most energetic and, therefore, richer of information component, such as the luminance components represented as a matrix of N lines and M columns. Q1 and Q2 are divided into portions called macroblocks, each of R lines and S columns.
The results of the divisions N/R and M/S must be two integer numbers, not necessarily equal to each other. Let Mb2 (i,j) be a macroblock defined as the reference macroblock belonging to the field Q2 and whose first pixel, in the top left part thereof, is at the intersection between the i-th line and the j-th column. The pair (i,j) is characterized in that i and j are integer multiples of R and S, respectively.
FIG. 5 shows how the reference macroblock is positioned on the Q2 picture while the horizontal dash line arrows indicate the scanning order used to identify the various macroblocks on Q1. MB2 (i,j) is projected on the Q1 field to obtain MB1 (i,j). A search window is defined on Q1 having its center at (i,j) and is composed of the macroblocks MBk[e,f], where k is the macroblock index. The k-th macroblock is identified by the coordinates (e,f), with e and f being an integer number, such that:
xe2x88x92p less than =(exe2x88x92i) less than =+pxe2x88x92q less than =(fxe2x88x92j) less than =+q
Each of the macroblocks is a possible predictor of MB2 (i,j). The different motion estimation algorithms are different among them according to the way they are searched and selected inside the search window. The predictor that minimizes a certain cost function is chosen among the possible predictors.
This function may vary according to the motion estimation algorithm selected. For example, in the MPEG2 standard, the predictors that minimize the L1 norm with respect to the reference macroblock were searched. The norm is equal to the sum of absolute values of the differences among common pixels and belongs to MB2 (i,j) and MBk (e,f). Each sum contributes R*S values, with a result being referred to as distortion.
The predictor most similar to MB2 (i,j) is identified by the coordinates of the prevailing predictor following the motion estimation. The vector formed by the difference between the position of the prevailing predictor and MB2 (i,j) is the motion vector. The motion vector describes how MB2 (i,j) originates from a shift of a similar macroblock inside the preceding field.
The B field or semifield will now be discussed in greater detail. Three picture fields are considered: QPnxe2x88x921 at the instant t, QBkB at the instant t+(kB)*T and QPn at the instant t+(kp)*T. The following discussion also applies to the semifields. The variables kp and kB are dependent on the number of B fields (or semifields) preventively selected. T is the field period. The field period is {fraction (1/25)} sec. for the PAL standard, and {fraction (1/30)} sec. for the NTSC standard. QPxe2x88x921, QBkB and QPn are formed by luminance and chrominance components. The motion estimation is applied only to the most energetic and, therefore, richer of the information components, such as the luminance components, representable as a matrix of N lines and M columns. QPnxe2x88x921, QBkB and QPn are divided in portions called macroblocks, each of R lines and S columns.
The results of the divisions N/R and M/S must be two integer numbers, not necessarily equal. MB2 (i,j) is a macroblock defined as the reference macroblock belonging to the field Q2 and whose first pixel, in the top left part thereof, is at the intersection between the i-th line and the j-th-column. The pair (i,j) is characterized by the fact that i and j are integer multiples of R and S, respectively. MB2 (i,j) is projected on the fQPnxe2x88x921 field to obtain MB1 (i,j), and is projected on the QPn field to obtain MB3 (i,j).
A search window is defined on QPnxe2x88x921 with its center at (i,j), and is composed of the macroblocks MB1k[e,f]. A similar search window is defined on Qpn whose dimension may also be different, or in any case predefined, and is composed of the macroblocks MB3k[e,f], where k is the macroblock index. The k-th macroblock on the QPnxe2x88x921 is identified by the coordinates (e,f), such that:
xe2x88x92p1 less than =(exe2x88x92i) less than =+p1xe2x88x92q1 less than =(fxe2x88x92j) less than =+q1
The k-th macroblock on the QPn field is identified by the coordinates (e,f) such that:
xe2x88x92p3 less than =(exe2x88x92i) less than =+p3xe2x88x92q3 less than =(fxe2x88x92j) less than =+q3
The indexes e and f are integer numbers. Each of the macroblocks is a predictor of MB2 (i,j). In this case, there are two types of predictors for MB2 (i,j). A first type is obtained from the field that temporarily precede the one containing the block to be estimated (I or P). This first type is referred to as a forward predictor. A second type is obtained from the field that temporarily follows the one containing the block to be estimated (I or P). This second type is referred to as a backward predictor.
Among the two sets of possible predictors, one backward and one forward predictor is selected to minimize a certain cost function of a hardware implementation. The predictors are selected depending on the type of motion estimation algorithm in use. This cost function may vary depending on the type of motion estimation selected. For example, in the MPEG2 standard the predictor that minimizes the L1 norm with respect to the reference macroblock is searched. Such a norm is equal to the sum of absolute values of the differences among common pixels, and belongs to MB2 (i,j) and MB1k(e,f) or MB3k(e,f). Each sum contributes R*S values. The result is called distortion.
Hence, a certain number of forward distortion values are obtained among which the lowest is chosen. This identifies a prevailing position (ef, ff) in the field QPnxe2x88x921. Certain backward distortion values are also identified, among which the minimum value is chosen identifying a new prevailing position. (eb, fb) on the Qpn field. Moreover, the distortion value between MB2 (i,j) and a theoretical macroblock obtained by linear interpolation of the two prevailing predictors is calculated. Therefore, MB2 (i,j) may be estimated using only three types of macroblocks. The forward predictor (ef, ff), or the backward predictor (eb, fb) or both, though averaged. The vector formed by the difference components between the position of the prevailing predictor and of MB2 (i,j) are defined as the motion vectors, and describe how MB2 (i,j) derives from a translation of a macroblock similar to it in the preceding and/or successive field.
An object of the present invention is to improve the efficiency of picture compression methods based on motion estimation algorithms by recognizing the presence of a picture sequence transformation referred to as of 3:2 pulldown.
From the above discussion, it is evident how the ability of recognizing the presence of a 3:2 pulldown may have a significant impact on the quality of the images, as well as on the efficiency of the compression process. It is essential to detect a 3:2 pulldown process in a continuous and simultaneous mode in relation to the other processes that take place in the coder.
There may be cases in which normal TV sequences are inserted into camera filmed sequences. For example, a TV broadcast film is regularly interrupted by commercials. Therefore, it is necessary to be able to immediately change from a film mode to a normal mode so that fields are not lost. Lost fields may cause an inevitable loss of information.
According to the present invention, the compression method of coding the digital data of a sequence of pictures comprises the steps of recognizing a 3:2 pulldown conversion of a certain number of film photograms in a sequence of a larger number of TV frames by duplicating some pictures and eliminating, as a consequence of the recognition, the redundancy due to such a picture duplication. The compression method is commonly based on a motion estimation procedure among successive pictures to remove the temporal redundancy of data.
The algorithm of the present invention is able to recognize the existence of repeated fields of a picture with respect to those of a preceding picture, and verifies that such repetitions are in a certain sequence. That is, the sequence corresponds to that of a 3:2 pulldown transformation. It is therefore a pre-requisite for the effectiveness of the method of the invention that the motion estimation algorithm implemented in the coder establish the motion between a frame and its successive frame in the sequence.
As already explained, the motion estimation algorithm searches, for each reference macroblock of the current picture, a predictor macroblock within a search window on a preceding frame or on a following frame relatively to the current frame. The best predictor is the one that minimizes a certain cost function (norm L1).
The algorithm for recognizing a 3:2 pulldown of the present invention exploits information made available by the system of motion estimation, the prevailing macroblock predictor and the cost function values. Since these data must be calculated during a coding phase, the determination of the 3:2 pulldown according to the method of the invention implies a relatively low impact on the overall complexity of the implementing hardware.
In a compression process the motion estimation between successive pictures of a sequence takes place by searching for each reference macroblock of a current picture a predictor macroblock within a search window on a picture temporally preceding or following the current one. By selecting the predictor that minimizes a certain cost function L1, the detection of a 3:2 pulldown in a flow of digital coding data of a sequence of pictures takes place through the following operations:
defining a reference macroblock of R*S pixels, half positioned on the top field and half positioned on the bottom field of a picture, each half including R*S/2 pixels;
searching for each macroblock of the current picture (i), with the exclusion of perimeter macroblocks, within the search window, and searching separately for the top half (Topi) and bottom half (Bottomi) for a macroblock of similar dimensions which would better predict (minimize the norm L1) with respect to the common half (Topixe2x88x921, Bottomixe2x88x921) on the temporarily preceding picture;
comparing the norm (L1Top) associated to the prevailing predictor searched on the temporally preceding picture (ixe2x88x921) with the norm (L1Top0) obtained by using, as a predictor for the R*S/2 macroblock relative to the top half (Topi) of the current picture, the macroblock R*S/2 in a common position on the preceding picture (Topixe2x88x921), and incrementing a first counter (number_0_Top) if the latter norm (L1Top0) is less than the first one (L1Top);
carrying out the same operation of comparison between norms (L1bottom, L1bottom0) relative to the bottom halves, and incrementing a second counter (number_0_Bottom);
verifying if the number contained in the first and/or second counter is higher than a certain threshold (a*number of macroblocks, with a less than 1), and the occurrence of a repetition of one and/or the other half of the current picture with respect to the preceding one; and
processing the data of acknowledgment or disacknowledgement of a repetition for a certain number of successive pictures for eventually detecting a coincidence with the repetition pattern of a 3:2 pulldown, thus confirming the detection of a 3:2 pulldown.