Conventional video compression techniques, such as MPEG-1, MPEG-2, H.261, H.262 and H.263 use both spatial and temporal compression or "encoding". A detailed discussion of spatial and temporal encoding may be found in B. Haskell, A. PURI & A. NETRAVALI, DIGITAL VIDEO: AN INTRODUCTION TO MPEG-2, ch.6.4, 6.5, and 7 (1997). For example in MPEG-2, a to-be-compressed, i.e., "to-be-encoded" picture (herein, "picture" means frame or field as per MPEG parlance) is divided into macroblocks. Each macroblock includes an array of I.times.J luminance blocks and of an array of K.times.L total blocks (i.e., including chrominance blocks), where each luminance or chrominance block has N.times.M pixels. Macroblocks may be spatially only encoded or both temporally and spatially encoded. Spatial encoding includes, for each luminance and chrominance block of the macroblock, discrete cosine transforming the pixels of the block, quantizing the block of transform coefficients, (zig zag or alternate) scanning each quantized block of coefficients, zero run length encoding the scanned values into run-level pairs and variable length coding each run-level pair.
Temporal encoding typically involves finding a prediction macroblock for each to-be-encoded macroblock. The prediction macroblock is subtracted from the to-be-encoded macroblock to form a prediction error macroblock. The individual blocks of the prediction error macroblock are then spatially encoded.
Each prediction macroblock originates in a picture other than the to-be-encoded picture, called a "reference picture." A single prediction macroblock may be used to "predict" a to-be-encoded macroblock or multiple prediction macroblocks, each origination in a different reference picture, may be interpolated, and the interpolated prediction macroblock may be used to "predict" the to-be-encoded macroblock. (Preferably, the reference picture, themselves, are first encoded and then decompressed or "decoded." The prediction macroblocks used in encoding are selected from "reconstructed pictures" produced by the decoding process.) Reference pictures temporally precede or succeed the to-be-encoded picture in the order of presentation or display. To be more precise, three kinds of encoded pictures may be produced, namely, intra pictures or I pictures predicted pictures or P pictures and bidirectionally predicted pictures or B pictures. I pictures contain spatially only encoded macroblocks but no temporally encoded macroblocks. P and B pictures can contain spatially only encoded macroblocks and spatially and temporally encoded macroblocks. In P pictures, the reference pictures used to predict and temporally encode the spatially and temporally encoded macroblocks only precede the encoded P picture. In B pictures the reference pictures can both precede and succeed the encoded B picture.
MPEG-2 supports several different types of prediction modes which can be selected for each to-be-encoded macroblock, based on the types of predictions that are permissible in that particular type of picture. Of the available prediction modes, two prediction modes are described below which are used to encoded frame pictures. According to a "frame prediction mode" a macroblock of a to-be-encoded frame picture is predicted by a frame prediction macroblock formed from one or more reference frames. For example, in the case of a forward only predicted macroblock, the prediction macroblock is formed from a designated preceding reference frame. In the case of backward only predicted macroblock, the prediction macroblock is formed from a designated succeeding reference frame. In the case of a bidirectionally predicted macroblock, the prediction macroblock is interpolated from a first macroblock formed from the designated preceding reference frame and a second prediction macroblock formed from the designated succeeding reference frame.
According to a "field prediction mode for frames" a macroblock of a to-be-encoded frame picture is divided into to-be-encoded top and bottom field macroblocks. A field prediction macroblock is separately obtained for each of the to-be-encoded top and bottom field macroblocks. Each field prediction macroblock is selected from top and bottom designated reference fields. The particular fields designated as reference fields depend on whether the to-be-encoded field macroblock is the first displayed field of a P-picture, the second displayed field of a P-picture or either field of a B-picture. Other well known prediction modes applicable to to-be-encoded field pictures include dual prime, field prediction of field pictures and 16.times.8 prediction See B. HASKELL, A. PURI & A. NETRAVALI, DIGITAL, VIDEO: AN INTRODUCTION TO MPEG-2, ch. 7.2 (1997). For sake of brevity, these modes are not described herein.
Prediction macroblocks often are not at the same relative spatial position (i.e., the same pixel row and column) in the reference picture as the to-be-encoded macroblock spatial position in the to-be-encoded picture. Rather, a presumption is made that each prediction macroblock represents a similar portion of the image as the to-be-encoded macroblock, which image portion may have moved spatially between the reference picture and the to-be-encoded picture. As such, each prediction macroblock is associated with a motion vector, indicating a spatial displacement from the prediction macroblock's original spatial position in the reference field to the spatial position corresponding to the to-be-encoded macroblock. This process of displacing one or more prediction macroblocks using a motion vector is referred to as motion compensation.
In motion compensated temporal encoding, the best prediction macroblock(s) for each to-be-encoded macroblock is generally not known ahead of time. Rather, a presumption is made that the best matching prediction macroblock is contained in a search window of pixels of the reference picture around the spatial coordinates of the to-be-encoded macroblock (if such a prediction macroblock exists at all). Given a macroblock of size I.times.J pixels, and a search range of .+-.H pixels horizontally and .+-.V pixels vertically, the search window is of size (I+2H)(J+2V). A block matching technique may be used, whereby multiple possible prediction macroblock candidates at different spatial displacements (i.e., with different motion vectors) are extracted from the search window and compared to the to-be-encoded macroblock. The best matching prediction macroblock candidate may be selected, and its spatial displacement is recorded as the motion vector associated with the selected prediction macroblock. The operation by which a prediction macroblock is selected, and its associated motion vector is determined, is referred to as motion estimation.
Block matching in motion estimation requires identifying the appropriate search window for each to-be-encoded macroblock (that can possibly be temporally encoded). Then multiple candidate macroblocks of pixels must be extracted from each search window and compared to the to-be-encoded macroblock. According to MPEG-2 chrominance format 4:2:0, each macroblock includes a 2.times.2 arrangement of four (8.times.8 pixel) luminance blocks (illustratively, block matching is performed only on the luminance blocks). If each to-be-encoded picture is a CIF format picture (352.times.288 pixels for NTSC frames and 352.times.144 for NTSC fields), then the number of to-be-encoded macroblocks is 396 for frame pictures and 196 for each field picture. According to MPEG-2, the search range can be as high as .+-.128 pixels in each direction. Furthermore, consider that MPEG-2 often provides a choice in selecting reference pictures for a to-be-encoded picture (i.e., a field-frame choice or a forward only, backward only or bidirectional interpolated choice). In short, the number of potential candidate prediction macroblocks is very high. An exhaustive comparison of all prediction macroblock candidates to the to-be-encoded macroblock may therefore be too processing intensive for real-time encoding. Nevertheless, an exhaustive search can provide better memory access efficiency due to the overlap in pixels in each prediction macroblock candidate compared against a given to-be-encoded macroblock. For example, consider that a retrieved prediction macroblock candidate of 16.times.16 pixels includes a sub-array of 15.times.16 pixels of the prediction macroblock candidate to the immediate right or left (an of course a sub-array of 16.times.15 pixels of the prediction macroblock candidate immediately above or below). Thus only the missing 1.times.16 column of pixels need be retrieved to form the next left or right prediction macroblock candidate (or the missing 1.times.16 row of pixels need be retrieved to form the next above or below prediction macroblock candidate).
According to another technique, a hierarchical or telescopic search is performed, in which fewer than all possible choices are examined. These techniques, while computationally less demanding, are more likely to fail to obtain the optimal or best matching prediction macroblock candidate. As a result, more bits are needed to encode the to-be-encoded macroblock in order to maintain the same quality than in the case where the best matching macroblock is obtained, or, if the number of bits per picture is fixed, the quality of the compressed picture will be degraded. Note also, that the memory access efficiency is lower for the hierarchical search, since by definition, the amount of overlapping pixels between each prediction macroblock will be lower.
Other techniques have been suggested in M. Ghanbari, The Cross-Search Algorithm for Motion Estimation, IEEE TRANS. ON COMM. Vol. 38, no. 7, pp. 950-953, July, 1990; B. Liu and A. Zaccarin, New Fast Algorithms for the Estimation of Block Motion Vectors, IEEE TRANS ON CIR. & SYS. FOR VIDEO TECH., vol. 3, no. 2, pp. 148-157, April, 1993; and P. Anandan, A Computational Framework and an Algorithm for the Measurement of Visual Motion, INT'L J. COMP. VISION, no. 2, pp. 283-310 (1989). The techniques described in the first two references do not work well with typical memory architectures which store the reference or to-be-encoded picture data. The latter reference is not well-suited for block based motion estimation and does not describe a computationally efficient technique.
The above-identified patent application incorporated herein by reference teaches an alternative motion estimation technique, which is illustrated in FIG. 1. According to this technique, multiple reduced resolution versions of the to-be-encoded frame and reference frames are generated. For example, 1/64, 1/16 and 1/4 resolution version of the original to-be-encoded and reference picture may be formed. A first stage motion estimation search ME0 is then performed on the 1/64 resolution version of the frame. The first stage motion estimation search ME0 includes five searches for identifying five prediction macroblocks in the forward prediction direction for each to-be-encoded macroblock of the to-be-encoded frame. The five searches include: (1) searching the reference frame for frame prediction macroblock candidates, (2) searching the top reference field for top field prediction macroblock candidates for the to-be-encoded top field macroblocks, (3) searching the top reference field for top field prediction macroblock candidates for the to-be-encoded bottom field macroblocks, (4) searching the bottom reference field for bottom field prediction macroblock candidates for the to-be-encoded top field macroblocks, and (5) searching the bottom reference field for bottom field prediction macroblock candidates for the to-be-encoded bottom field macroblocks. If backwards prediction is permitted, the first stage motion estimation search includes five additional searches for identifying prediction macroblocks in the backward prediction direction (i.e., identifying prediction macroblock candidates in succeeding reference pictures). In this first stage, each search window is centered at the same spatial coordinates of the to-be-encoded macroblock for which the block matching is performed, and thus, the initial starting point of the search is a (0,0) spatial displacement or motion vector. A motion vector is obtained for each identified prediction macroblock candidate by virtue of the searches.
A similar second stage motion estimation search ME1 is then performed on the 1/16 resolution version of the to-be-encoded frame. Like the first stage motion estimation search ME), the second stage motion estimation search uses the (0,0) motion vector as the initial starting point for each search window.
The motion vectors identified in the first motion estimation stage ME0 are then scaled by 4 and the motion vectors obtained in the second motion estimation stage ME1 are then scaled by 2. A third stage motion estimation ME2 is then performed on the 1/4 resolution version of the to-be-encoded frame. However, unlike the first and second motion estimation search stages ME0 and ME1, the third motion estimation search stage ME2 uses the vectors of the first and second motion estimation search stages ME0 and ME1 as initial starting points. In other words, the search window for each search on each macroblock is centered about a respective prediction macroblock identified by one of the motion vectors determined in the first or second motion estimation search stages ME0 and ME1. Thus, in the third motion estimation search stage ME2, ten searches (one of each of the five searches using the results from stage one and one of each of the five searches using the results from stage two) or twenty searches (if both forward and backward prediction are permissible) are performed to produce ten (or twenty) motion vectors for each to-be-encoded macroblock.
After performing the third motion estimation search stage ME2, a decision is made for each to-be-encoded macroblock, on a macroblock-by-macroblock basis, as to which parity reference field should be used to predict the to-be-encoded top field and which parity reference field should be used to predict the to-be-encoded bottom field. This decision is referred to as a "motion vertical field select" decision as per the MPEG-2 syntax. As a result of this decision, four motion vectors are discarded for each to-be-encoded macroblock (or in the case that backward prediction is permitted, eight motion vectors are discarded). In particular, the two (four) motion vectors obtained in the first and second motion estimation search stages ME1 and ME2 having the parity not selected for the top field, and the two (four) motion vectors obtained in the first and second motion estimation search stages ME1 and ME2 having the parity not selected for the bottom field, of the to-be-encoded macroblock, are discarded.
The remaining six (or twelve) motion vectors are then scaled by two. A fourth stage motion estimation search stage ME3 is then performed on the original resolution to-be-encoded picture using the scaled motion vectors as a starting point. This produces six (or twelve) motion vectors, each corresponding to a respective prediction macroblock. The best matching prediction macroblock is then selected. In so selecting, a field/frame prediction decision is made and a forward only, backward only or interpolated macroblock decision may be made. It is also possible to make the field/frame decision the forward only, backward only or interpolated prediction decisions, or both types of decisions before the ME3 stage.
Because the third motion estimation stage ME2 searches uses the results of the first and second stage ME0 and ME1 searches as an initial starting point, it is possible to search a smaller search window in the third motion estimation stage ME2. Furthermore, an exhaustive search in the smaller search window can be performed to ensure that an optimal search is performed. Likewise, the fourth motion estimation search stage ME3 uses the motion vectors obtained in the third motion estimation search stage ME2 and therefore can exhaustively search smaller window. As a result, computations are reduced yet near optimal results are achieved.
Thus, the motion estimation dramatically reduces the number of computations yet provides near optimal motion estimation.
It is an object to further improve on the projection motion estimation technique to further reduce computation requirements without a substantial effect on picture quality or bit rate.