The present invention relates generally to digital video compression, and, more particularly, to a motion estimation method and search engine for a digital video encoder that is simpler, faster, and less expensive than the presently available technology permits, and that permits concurrent motion estimation using multiple prediction modes.
Many different compression algorithms have been developed in the past for digitally encoding video and audio information (hereinafter referred to generically as xe2x80x9cdigital video data streamxe2x80x9d) in order to minimize the bandwidth required to transmit this digital video data stream for a given picture quality. Several multimedia specification committees have established and proposed standards for encoding/compressing and decoding/decompressing audio and video information. The most widely accepted international standards have been proposed by the Moving Pictures Expert Group (MPEG), and are generally referred to as the MPEG-1 and MPEG-2 standards. Officially, the MPEG-1 standard is specified in the ISO/IEC 11172-2 standard specification document, which is herein incorporated by reference, and the MPEG-2 standard is specified in the ISO/IEC 13818-2 standard specification document, which is also herein incorporated by reference. These MPEG standards for moving picture compression are used in a variety of current video playback products, including digital versatile (or video) disk. (DVD) players, multimedia PCs having DVD playback capability, and satellite broadcast digital video. More recently, the Advanced Television Standards Committee (ATSC) announced that the MPEG-2 standard will be used as the standard for Digital HDTV transmission over terrestrial and cable television networks. The ATSC published the Guide to the Use of the ATSC Digital Television Standard on Oct. 4, 1995, and this publication is also herein incorporated by reference.
In general, in accordance with the MPEG standards, the audio and video data comprising a multimedia data stream (or xe2x80x9cbit streamxe2x80x9d) are encoded/compressed in an intelligent manner using a compression technique generally known as xe2x80x9cmotion codingxe2x80x9d. More particularly, rather than transmitting each video frame in its entirety, MPEG uses motion estimation for only those parts of sequential pictures that vary due to motion, where possible. In general, the picture elements or xe2x80x9cpixelsxe2x80x9d of a picture are specified relative to those of a previously transmitted reference or xe2x80x9canchorxe2x80x9d picture using differential or xe2x80x9cresidualxe2x80x9d video, as well as so-called xe2x80x9cmotion vectorsxe2x80x9d that specify the location of a 16-by-16 array of pixels or xe2x80x9cmacroblockxe2x80x9d within the current picture relative to its original location within the anchor picture. Three main types of video frames or pictures are specified by MPEG, namely, I-type, P-type, and B-type pictures.
An I-type picture is coded using only the information contained in that picture, and hence, is referred to as an xe2x80x9cintra-codedxe2x80x9d or simply, xe2x80x9cintraxe2x80x9d picture.
A P-type picture is coded/compressed using motion compensated prediction (or xe2x80x9cmotion estimationxe2x80x9d) based upon information from a past reference (or xe2x80x9canchorxe2x80x9d) picture (either I-type or P-type), and hence, is referred to as a xe2x80x9cpredictivexe2x80x9d or xe2x80x9cpredictedxe2x80x9d picture.
A B-type picture is coded/compressed using motion compensated prediction (or xe2x80x9cmotion estimationxe2x80x9d) based upon information from either a past and or a future reference picture (either I-type or P-type), or both, and hence, is referred to as a xe2x80x9cbidirectionalxe2x80x9d picture. B-type pictures are usually inserted between I-type or P-type pictures, or combinations of either.
The term xe2x80x9cintra picturexe2x80x9d is used herein to refer to I-type pictures, and the term xe2x80x9cnon-intra picturexe2x80x9d is used herein to refer to both P-type and B-type pictures. It should be mentioned that although the frame rate of the video data represented by an MPEG bit stream is constant, the amount of data required to represent each frame can be different, e.g., so that one frame of video data (e.g., {fraction (1/30)} of a second of playback time) can be represented by x bytes of encoded data, while another frame of video data can be represented by only a fraction (e.g., 5%) of x bytes of encoded data. Since the frame update rate is constant during playback, the data rate is variable.
In general, the encoding of an MPEG video data stream requires a number of steps. The first of these steps consists of partitioning each picture into macroblocks. Next, in theory, each macroblock of each xe2x80x9cnon-intraxe2x80x9d picture in the MPEG video data stream is compared with all possible 16-by-16 pixel arrays located within specified vertical and horizontal search ranges of the current macroblock""s corresponding location in the anchor picture(s). The MPEG picture and macroblock structure is diagrammatically illustrated in FIG. 1.
The aforementioned search or xe2x80x9cmotion estimationxe2x80x9d procedure, for a given prediction mode, results in a motion vector(s) that corresponds to the position of the closest-matching macroblock (according to a specified matching criterion) in the anchor picture(s) within the specified search range. Once the prediction mode and motion vector(s) have been determined, the pixel values of the closest-matching macroblock are subtracted from the corresponding pixels of the current macroblock, and the resulting 16-by-16 array of differential pixels is then transformed into 8-by-8 xe2x80x9cblocks,xe2x80x9d on each of which is performed a discrete cosine transform (DCT), the resulting coefficients of which are each quantized and Huffman-encoded (as are the prediction type, motion vectors, and other information pertaining to the macroblock) to generate the MPEG bit stream. If no adequate macroblock match is detected in the anchor picture, or if the current picture is an intra, or xe2x80x9cI-xe2x80x9d picture, the above procedures are performed on the actual pixels of the current macroblock (i.e., no difference is taken with respect to pixels in any other picture), and the macroblock is designated an xe2x80x9cintraxe2x80x9d macroblock.
For all MPEG-2 prediction modes, the fundamental technique of motion estimation consists of comparing the current macroblock with a given 16-by-16 pixel array in the anchor picture, estimating the quality of the match according to the specified metric, and repeating this procedure for every such 16-by-16 pixel array located within the search range. The hardware or software apparatus that performs this search is usually termed the xe2x80x9csearch engine,xe2x80x9d and there exists a number of well-known criteria for determining the quality of the match. Among the best-known criteria are the Minimum Absolute Error (MAE), in which the metric consists of the sum of the absolute values of the differences of each of the 256 pixels in the macroblock with the corresponding pixel in the matching anchor picture macroblock; and the Minimum Square Error (MSE), in which the metric consists of the sum of the squares of the above pixel differences. In either case, the match having the smallest value of the corresponding sum is selected as the best match within the specified search range, and its horizontal and vertical positions relative to the current macroblock therefore constitute the motion vector. If the resulting minimum sum is nevertheless deemed to large, a suitable match does not exist for the current macroblock, and it is coded as an intra macroblock. For the purposes of the present invention, either of the above two criteria, or any other suitable criterion, may be used.
In accordance with the MPEG-2 standard, any of a number of so-called xe2x80x9cprediction modesxe2x80x9d may be used for each individual macroblock that is encoded; the optimum prediction mode depends both on the type of picture being encoded and on the characteristics of the portion of the picture in which the given macroblock being encoded is located. Currently known methods of motion coding allow the use of different prediction modes, but generally require one prediction mode to be specified for a given macroblock before an actual motion estimation is performed. Although such a determination can often be made based upon prior knowledge of the picture or image source characteristics, there are many cases where the optimum prediction mode cannot be known unless more than one motion estimation is performed for the macroblock in question. Since motion estimation usually consists of an exhaustive search procedure in which all 256 pixels of two corresponding macroblocks are compared, and which is repeated for a large number of macroblocks, the latter is not a practical option.
Computation of the motion vector(s) for a given macroblock is typically performed by means of an exhaustive search procedure. The current macroblock in question is xe2x80x9ccomparedxe2x80x9d with a macroblock-sized pixel array within the anchor picture that is offset by an amount less than specified vertical and horizontal distances, called the xe2x80x9csearch ranges,xe2x80x9d and an xe2x80x9cerrorxe2x80x9d value is computed for this particular xe2x80x9cmatchxe2x80x9d of the macroblock using a specified criterion, or xe2x80x9cmetric,xe2x80x9d that gives a measure of how large the error is. This is done for every possible combination of vertical and horizontal offset values within the respective search ranges, and the offset pair that yields the smallest error according to the chosen metric is selected as the motion vector for the current macroblock relative to the anchor picture. Clearly, this procedure is very computationally intensive.
Based on the above and foregoing, it can be appreciated that there presently exists a need in the art that overcomes the disadvantages and shortcomings of the presently available technology. The present invention fulfills this need in the art by performing motion coding of an uncompressed digital video sequence in such a manner that the prediction mode for each individual macroblock is determined as part of the motion estimation process, along with the actual motion vector(s), and need not be specified in advance; only the type of picture currently being coded need be known. Since the latter must be determined at a higher level of video coding than the macroblock layer, this method makes possible a much more efficient, as well as optimal, degree of video compression than would otherwise be possible using conventional methods of motion estimation. Further, the present invention provides a novel scheme for concurrently searching for the optimum macroblock match within the appropriate anchor picture according to each of a plurality of motion prediction modes during the same search operation for the given macroblock, without the need for a separate search to be performed on the same macroblock for each such mode. Since this search procedure is the single most complex and expensive aspect of motion estimation, in both time and hardware, such a method as the present invention will clearly result in a more efficient video image coding and compression than would otherwise be possible given the aforementioned practical limitations of the presently available technology.
Although the present invention was primarily motivated by the specific requirements of the ATSC standard, it can nevertheless be used with any digital video transmission or storage system that employs a video compression scheme, such as MPEG, in which motion coding with multiple prediction modes is used.
The present invention encompasses a method for motion coding an uncompressed digital video data stream such as an MPEG-2 digital video data stream. The method includes the steps of comparing pixels of a first pixel array in a picture currently being coded with pixels of a plurality of second pixel arrays in at least one reference picture and concurrently performing motion estimation for each of a plurality of different prediction modes in order to determine which of the prediction modes is an optimum prediction mode, determining which of the second pixel arrays constitutes a best match with respect to the first pixel array for the optimum prediction mode, and, generating a motion vector for the first pixel array in response to the determining step. The method is implemented in a device such as a motion estimation search system of a digital video encoder. In one embodiment, the method and device are capable of concurrently determining performing motion estimation in each of the six different possible prediction modes specified by the MPEG-2 standard.
The present invention also encompasses a method for motion coding a digital video data stream comprised of a sequence of pictures having top and bottom fields which includes the steps of comparing pixels of a first portion (e.g., 16-by-8 portion) of a current macroblock (e.g., a 16-by-16 macroblock) of the top field of a current picture with pixels of each of a plurality of correspondingly-sized portions of a macroblock of a top field of an anchor picture in accordance with a prescribed search metric, and producing a first error metric for each comparison; comparing pixels of the first portion (e.g., 16-by-8 portion) of the current macroblock of the top field of the current picture with pixels of each of the plurality of correspondingly-sized portions of a macroblock of a bottom field of the anchor picture in accordance with the prescribed search metric, and producing a second error metric for each comparison; comparing pixels of a second portion (e.g., 16-by-8 portion) of a current macroblock (e.g., a 16-by-16 macroblock) of the bottom field of the current picture with pixels of each of the plurality of correspondingly-sized portions of the macroblock of the top field of the anchor picture in accordance with the prescribed search metric, and producing a third error metric for each comparison; comparing pixels of the:second portion (e.g., a 16-by-8 portion) of the current macroblock of the bottom. field of the current picture with pixels of each of the plurality of correspondingly-sized portions of the macroblock of the bottom field of the anchor picture in accordance with the prescribed search metric, and producing a fourth error metric for each comparison; summing the first and fourth error metrics to produce a first composite error metric; summing the second and third error metrics to produce a second composite error metric; and, determining which of the first, second, third, and fourth error metrics, and first and second composite error metrics has the lowest value, and selecting one a plurality of possible motion estimation prediction modes on the basis of such determination. Preferably and advantageously, all of the comparing steps are performed concurrently, and both of the summing steps are performed concurrently. The plurality of possible motion estimation prediction modes can include frame and field prediction modes for frame pictures in accordance with the MPEG-2 standard.
The present invention also encompasses a method for motion coding a digital video data stream comprised of a sequence of pictures, in which the method includes the steps of comparing pixels of a first portion (e.g., 16-by-8 portion) of a top half of a current macroblock (e.g., a 16-by-16 macroblock) of a current picture with pixels of each of a plurality of correspondingly-sized portions of a macroblock of a top field of an anchor picture in accordance with a prescribed search metric, and producing a first error metric for each comparison; comparing pixels of the first portion (e.g., 16-by-8 portion) of the top half of the current macroblock of the current picture with pixels of each of the plurality of correspondingly-sized portions of a macroblock of a bottom field of the anchor picture in accordance with the prescribed search metric, and producing a second error metric for each comparison; comparing pixels of a second portion (e.g., 16-by-8 portion) of a bottom half of a current macroblock (e.g., a 16-by-16 macroblock) of the current picture with pixels of each of the plurality of correspondingly-sized portions of the macroblock of the top field of the anchor picture in accordance with the prescribed search metric, and producing a third error metric for each comparison; comparing pixels. of the second portion (e.g., a 16-by-8 portion) of the bottom half of the current macroblock of the current picture with pixels of each of the plurality of correspondingly-sized portions of the inacroblock of the bottom field of the anchor picture in accordance with the prescribed search metric, and producing a fourth error metric for each comparison; summing the first and third error metrics to produce a first composite error metric; summing the second and fourth error metrics to produce a second composite error metric; and, determining which of the first, second, third, and fourth error metrics, and first and second composite error metrics has the lowest value, and selecting one a plurality of possible motion estimation prediction modes on the basis of such determination. Preferably and advantageously, all of the comparing steps are performed concurrently, and both of the summing steps are performed concurrently. The plurality of possible motion estimation prediction modes can include field and 16xc3x978 prediction modes for field pictures in accordance with the MPEG-2 standard.
The present invention also encompasses a method for motion coding a digital video data stream comprised of a sequence of pictures having top and bottom fields which includes the steps of comparing pixels of a first portion (e.g., 16-by-8 portion) of a current macroblock (e.g., a 16-by-16 macroblock) of the top field of a current picture with pixels of each of a plurality of correspondingly-sized portions of a macroblock of a top field of an anchor picture in accordance with a prescribed search metric, and producing a first error metric for each comparison; comparing pixels of the first portion (e.g., 16-by-8 portion) of the current macroblock of the top field of the current picture with pixels of each of the plurality of correspondingly-sized portions of a macroblock of a bottom field of the anchor picture in accordance with the prescribed search metric, and producing a second error metric for each comparison; comparing pixels of a second portion (e.g., 16-by-8 portion) of a current macroblock (e.g., a 16-by-16 macroblock) of the bottom field of the current picture with pixels of each of the plurality of correspondingly-sized portions of the macroblock of the top field of the anchor picture in accordance with the prescribed search metric, and producing a third error metric for each comparison; comparing pixels of the second portion (e.g., a 16-by-8 portion) of the current macroblock of the bottom field of the current picture with pixels of each of the plurality of correspondingly-sized portions of the macroblock of the bottom field of the anchor picture in accordance with the prescribed search metric, and producing a fourth error metric for each comparison; producing first, second, third, and fourth motion vectors on the basis of the first, second, third, and fourth error metrics, respectively; and, examining the first, second, third, and fourth motion vectors to determine whether a prescribed relationship between them is present, and, if so, selecting a frame picture dual-prime motion estimation prediction mode. Preferably and advantageously, all of the comparing steps are performed concurrently.
The present invention also encompasses a method for motion coding a digital video data stream comprised of a sequence of pictures, in which the method includes the steps of comparing pixels of a first portion (e.g., 16-by-8 portion) of a top half of a current macroblock (e.g., a 16-by-16 macroblock) of a current picture with pixels of each of a plurality of correspondingly-sized portions of a macroblock of a top field of an anchor picture in accordance with a prescribed search metric, and producing a first error metric for each comparison; comparing pixels of the first portion (e.g., 16-by-8 portion) of the top half of the current macroblock of the current picture with pixels of each of the plurality of correspondingly-sized portions of a macroblock of a bottom field of the anchor picture in accordance with the prescribed search metric, and producing a second error metric for each comparison; comparing pixels of a second portion (e.g., 16-by-8 portion) of a bottom half of a current macroblock (e.g., a 16-by-16 macroblock) of the current picture with pixels of each of the plurality of correspondingly-sized portions of the macroblock of the top field of the anchor picture in accordance with the prescribed search metric, and producing a third error metric for each comparison; comparing pixels of the second portion (e.g., a 16-by-8 portion) of the bottom half of the current macroblock of the current picture with pixels of each of the plurality of correspondingly-sized portions of the macroblock of the bottom field of the anchor picture in accordance with the prescribed search metric, and producing a fourth error metric for each comparison; summing the first and third error metrics to produce a first composite error metric; summing the second and fourth error metrics to produce a second composite error metric; producing first and second motion vectors on the basis of the first and second composite error metrics, respectively; and, examining the first and second motion vectors to determine whether a prescribed relationship between them is present, and if so, selecting a field picture dual-prime motion estimation prediction mode. Preferably and advantageously, all of the comparing steps are performed concurrently, both of the summing steps are performed concurrently, and both of the producing steps are performed concurrently.
The present invention further encompasses a device such as a motion estimation search system for a digital video encoder that concurrently implements any of the above-described methods of the present invention in any combination thereof.