The present invention relates generally to methods and apparatus for motion estimation for video image processing, and in particular, improved methods and apparatus for determining motion vectors between video image pictures with a hierarchical motion estimation technique using block-matching and integral projection data.
Advancements in digital technology have produced a number of digital video applications. Digital video is currently used in digital and high definition TV, camcorders, videoconferencing, computer imaging, and high-quality video tape recorders. Uncompressed digital video signals constitute a huge amount of data and therefore require a large amount of bandwidth and memory to store and transmit. Many digital video systems, therefore, reduce the amount of digital video data by employing data compression techniques that are optimized for particular applications. Digital compression devices are commonly referred to as xe2x80x9cencodersxe2x80x9d; devices that perform decompression are referred to as xe2x80x9cdecodersxe2x80x9d. Devices that perform both encoding and decoding are referred to as xe2x80x9ccodecsxe2x80x9d.
In the interest of standardizing methods for motion picture video compression, the Motion Picture Experts Group (MPEG) issued a number of standards. MPEG-1 is a compression algorithm intended for video devices having intermediate data rates. MPEG-2 is a compression algorithm for devices using higher data rates, such as digital high-definition TV (HDTV), direct broadcast satellite systems (DBSS), cable TV (CATV), and serial storage media such as digital video tape recorders (VTR). Digital Video (DV) format is another format used widely in consumer video products, such as digital camcorders. The DV format is further explained in the SD Specifications of Consumer-Use Digital VCRs dated December 1994.
A video sequence is composed of a series of still pictures taken at closely spaced intervals in time that are sequentially displayed to provide the illusion of continuous motion. Each picture may be described as a two-dimensional array of samples, or xe2x80x9cpixelsxe2x80x9d. Each pixel describes a specific location in the picture in terms of brightness and hue. Each horizontal line of pixels in the two-dimensional picture is called a raster line. Pictures may be comprised of a single frame or two fields.
When sampling or displaying a frame of video, the video frame may be xe2x80x9cinterlacedxe2x80x9d or xe2x80x9cprogressive.xe2x80x9d Progressive video consists of frames in which the raster lines are sequential in time, as shown in FIG. 1A. The MPEG-1 standard allows only progressive frames. Alternatively, each frame may be divided into two interlaced fields, as shown in FIG. 1B. Each field has half the lines in the full frame and the fields are interleaved such that alternate lines in the frame belong to alternative fields. In an interlaced frame composed of two fields, one field is referred to as the xe2x80x9ctopxe2x80x9d field, while the other is called the xe2x80x9cbottomxe2x80x9d field. The MPEG-2 standard allows both progressive and interlaced video.
One of the ways MPEG applications achieve data compression is to take advantage of the redundancy between neighboring pictures of a video sequence. Since neighboring pictures tend to contain similar information, describing the difference between neighboring pictures typically requires less data than describing the new picture. If there is no motion between neighboring pictures, for example, coding the difference (zero) requires less data than recoding the entire new picture.
An MPEG video sequence is comprised of one or more groups of pictures, each group of which is composed of one or more pictures of type I-, P-, or B-. Intra-coded pictures, or xe2x80x9cI-pictures,xe2x80x9d are coded independently without reference to any other pictures. Predictive-coded pictures, or xe2x80x9cP-pictures,xe2x80x9d use information from preceding reference pictures, while bidirectionally predictive-coded pictures, or xe2x80x9cB-pictures,xe2x80x9d may use information from preceding or upcoming pictures, both, or neither.
Motion estimation is the process of estimating the displacement of a portion of an image between neighboring pictures. For example, a moving soccer ball will appear in different locations in adjacent pictures. Displacement is described as the motion vectors that give the best match between a specified region, e.g., the ball, in the current picture and the corresponding displaced region in a preceding or upcoming reference picture. The difference between the specified region in the current picture and the corresponding displaced region in the reference picture is referred to as xe2x80x9cresiduexe2x80x9d.
In general, two known types of motion estimation methods used to estimate the motion vectors are pixel-recursive algorithms and block-matching algorithms. Pixel-recursive techniques predict the displacement of each pixel iteratively from corresponding pixels in neighboring frames. Block-matching algorithms, on the other hand, estimate the displacement between frames on a block-by-block basis and choose vectors that minimize the difference.
In conventional block-matching processes, the current image to be encoded is divided into equal-sized blocks of pixel information. In MPEG-1 and MPEG-2 video compression standards, for example, the pixels are grouped into xe2x80x9cmacroblocks,xe2x80x9d each consisting of a 16xc3x9716 sample array of luminance samples together with one 8xc3x978 block of samples for each of the two chrominance components. The 16xc3x9716 array of luminance samples further comprises four 8xc3x978 blocks that are typically used as input blocks to the compression models.
FIG. 2 illustrates one iteration of a conventional block-matching process. Current picture 220 is shown divided into blocks. Each block can be any size; however, in an MPEG device, for example, current picture 220 would typically be divided into blocks each consisting of 16xc3x9716-sized macroblocks. To code current picture 220, each block in current picture 220 is coded in terms of its difference from a block in a previous picture 210 or upcoming picture 230. In each iteration of a block-matching process, current block 200 is compared with similar-sized xe2x80x9ccandidatexe2x80x9d blocks within search range 215 of preceding picture 210 or search range 235 of upcoming picture 230. The candidate block of the preceding or upcoming picture that is determined to have the smallest difference with respect to current block 200 is selected as the reference block, shown in FIG. 2 as reference block 250. The motion vectors and residues between reference block 250 and current block 200 are computed and coded. Current picture 220 can be restored during decompression using the coding for each block of reference picture 210 as well as motion vectors and residues for each block of current picture 220. The motion vectors associated with the preceding reference picture are called forward motion vectors, whereas those associated with the upcoming reference picture are called backward motion vectors.
Difference between blocks may be calculated using any one of several known criterion, however, most methods generally minimize error or maximize correlation. Because most correlation techniques are computationally intensive, error-calculating methods are more commonly used. Examples of error-calculating measures include mean square error (MSE), mean absolute distortion (MAD), and sum of absolute distortions (SAD). These criteria are described in Joan L. Mitchell et al., MPEG Video Compression Standard, International Thomson Publishing (1997), pp. 284-86.
A block-matching algorithm that compares the current block to every candidate block within the search range is called a xe2x80x9cfull searchxe2x80x9d. In general, larger search areas generally produce a more accurate displacement vector, however, the computational complexity of a full search is proportional to the size of the search area and is too slow for some applications. A full search block-matching algorithm applied on a macroblock of size 16xc3x9716 pixels over a search range of xc2x1N pixels with one pixel accuracy, for example, requires (2xc3x97N+1)2 block comparisons. For N=16, 1089 16xc3x9716 block comparisons are required. Because each block comparison requires 16xc3x9716, or 256, calculations, this method is computationally intensive and operationally very slow. Techniques that simply reduce the size of the search area, however, run a greater risk of failing to find the optimal matching block.
As a result, there has been much emphasis on producing fast algorithms for finding the matching block within a wide search range. Several of these techniques are described in Mitchell et al., pp. 301-11. Most fast search techniques gain speed by computing the displacement only for a sparse sampling of the full search area. The 2-D logarithmic search, for example, reduces the number of computations by computing the MSE for sparsely-spaced candidates, and then successively searching the closer spaced candidates surrounding the best candidate found in the previous iteration. In a conjugate direction search, the algorithm searches in a horizontal direction until a minimum distortion is found. Then, proceeding from that point, the algorithm searches in a vertical direction until a minimum is found. Both of these methods are faster than a full search but frequently fail to locate the optimal matching block.
Another method for reducing the amount of computation in a full search is to calculate the displacement between blocks using integral projection data rather than directly using spatial domain pixel information. An integral projection of pixel information is a one-dimensional array of sums of image pixel values along a horizontal or vertical direction. Using two 1-D horizontal and vertical projection arrays rather than the 2-dimensional array of pixel information in a block-matching algorithm significantly reduces the number of computations of each block-matching. This technique is described in a paper by I. H. Lee and R. H. Park entitled xe2x80x9cFast Block Matching Algorithms Using Integral Projections,xe2x80x9d Proc. Tencon ""87 Conf., 1987, pp. 590-594.
Other methods for overcoming the disadvantages of a full search have employed hierarchical search techniques. In a first stage, for example, a coarse search is performed over a reasonably large area. In successive stages of a conventional hierarchical search, the size of the search area is reduced. One example of a three-step hierarchical search is described in H. M. Jong et al., xe2x80x9cParallel Architectures for 3-Step Hierarchical Search Block-Matching Algorithm,xe2x80x9d IEEE Trans. On Circuits and Systems for Video Technology, Vol. 4, August 1994, pp. 407-416. The hierarchical search described in Jong et al. is inadequate for some applications because the coarse search does not utilize all of the pixel information and thus may form an incorrect starting point for the finer search. Another type of hierarchical search is disclosed in U.S. patent application No. 09/093,307, to Chang et al., filed on Jun. 9, 1998, entitled xe2x80x9cHierarchical Motion Estimation Process and System Using Block-Matching and Integral Projectionxe2x80x9d (xe2x80x9cChang Ixe2x80x9d), the contents of which are hereby expressly incorporated by reference.
Fast motion estimation techniques are particularly useful when converting from one digital video format to another. Digital video is stored in encoded, compressed form. When converting from one format to another using conventional devices, the digital video must first be decompressed and decoded to its original pixel form and then subsequently encoded and compressed for storage or transmission in the new format. Conversion techniques requiring that digital video be fully decoded are very time-consuming.
The present invention provides improved methods and apparatus for performing motion estimation using a multi-tiered search technique that minimizes the number of operations while maintaining the quality of the motion vector. In addition, the present invention provides methods and apparatus for motion estimation that allow digital video data conversion from one format to a second format without full reduction to pixel data thereby greatly reducing the time required for data format conversion.
Methods, systems, apparatus, and computer program products consistent with the present invention obtain a motion vector between first and second pictures of video image data in a video sequence. Each picture includes a plurality of macroblocks. First and second motion vectors based on the first and second macroblocks are determined using a first and second search method. A third motion vector is estimated using the neighboring first and second macroblocks. A fourth motion vector for a fourth macroblock is estimated using a third search method and a fifth motion vector for a fifth macroblock neighboring both the first and fourth macroblocks is estimated based on the first and fourth motion vectors. A sixth motion vector for a sixth macroblock is determined using a fourth search method; and a seventh motion vector for a seventh macroblock is estimated based on the sixth motion vector and at least one of the first, second and fourth motion vectors. An eighth motion vector for a eighth macroblock neighboring the sixth macroblock and at least one of the first, second and fourth macroblocks is estimated based on the sixth motion vector and at least one of the first, second and fourth motion vectors. A ninth motion vector is estimated based on at least two of the first eight motion vectors. In methods and systems consistent with the present invention, the macroblock is encoded in one of several ways depending on thresholds. For example, the minimum of the first motion vector, the second motion vector, or an average of the first and second motion vectors may be encoded or a macroblock may be encoded independently of other pictures.