The present invention relates generally to methods and apparatus for motion estimation for video image processing, and in particular, is directed to improved methods and apparatus for determining motion vectors between frames of video images using a sparse search block-matching motion estimation technique and integral projection data.
Advancements in digital technology have produced a number of digital video applications. Digital video is currently used in digital and high definition TV videoconferencing, computer imaging, and high-quality video tape recorders. Uncompressed digital video signals constitute a huge amount of data and therefore require a large amount of bandwidth and memory to store and transmit. Many digital video systems, therefore, reduce the amount of digital video data by employing data compression techniques that are optimized for particular applications. Digital compression devices are commonly referred to as “encoders”; devices that perform decompression are referred to as “decoders”. Devices that perform both encoding and decoding are referred to as “codecs”.
In the interest of standardizing methods for motion picture video compression, the Motion Picture Experts Group (MPEG) issued a number of standards for digital video processing. MPEG-1 addresses digital audio and video coding and is commonly used by video devices needing intermediate data rates. MPEG-2 is used with devices using higher data rates, such as direct broadcast satellite systems.
Motion picture video sequences consist of a series of still pictures or “frames” that are sequentially displayed to provide the illusion of continuous motion. Each frame may be described as a two-dimensional array of picture elements, or “pixels”. Each pixel describes a particular point in the picture in terms of brightness and hue. Pixel information can be represented in digital form, or encoded, and transmitted digitally.
One way to compress video data is to take advantage of the redundancy between neighboring frames of a video sequence. Since neighboring frames tend to contain similar information, describing the difference between frames typically requires less data than describing the new frame. If there is no motion between frames, for example, coding the difference (zero) requires less data than recoding the entire frame.
Motion estimation is the process of estimating the displacement between neighboring frames. Displacement is described as the motion vectors that give the best match between a specified region in the current frame and the corresponding displaced region in a previous or subsequent reference frame. The difference between the specified region in the current frame and the corresponding displaced region in the reference frame is referred to as “residue”.
In general, there are two known types of motion estimation methods used to estimate the motion vectors: pixel-recursive algorithms and block-matching algorithms. Pixel-recursive techniques predict the displacement of each pixel iteratively from corresponding pixels in neighboring frames. Block-matching algorithms, on the other hand, estimate the displacement between frames on a block-by-block basis and choose vectors that minimize the difference.
In conventional block-matching processes, the current image to be encoded is divided into equal-sized blocks of pixel information. In MPEG video compression standards, the pixels are grouped into “macroblocks” consisting of a 16×16 sample array of luminance samples together with one 8×8 block of samples for each of the two chrominance components. The 16×16 array of luminance samples further comprises four 8×8 blocks that are typically used as input blocks to the compression models.
FIG. 1 illustrates one iteration of a conventional block-matching process. Current frame 120 is shown divided into blocks. Each block can be any size, however, in an MPEG device, for example, current frame 120 would typically be divided into 16×16-sized macroblocks. To code current frame 120, each block in current frame 120 is coded in terms of its difference from a block in a previous frame 110 or upcoming frame 130. In each iteration of a block-matching process, current block 100 is compared with similar-sized “candidate” blocks within search range 115 of preceding frame 110 or search range 135 of upcoming frame 130. The candidate block of the preceding or upcoming frame that is determined to have the smallest difference with respect to current block 100 is selected as the reference block, shown in FIG. 1 as reference block 150. The motion vectors and residues between reference block 150 and current block 100 are computed and coded. Current frame 120 can be restored during decompression using the coding for each block of reference frame 110 as well as motion vectors and residues for each block of current frame 120.
Difference between blocks may be calculated using any one of several known criterion, however, most methods generally minimize error or maximize correlation. Because most correlation techniques are computationally intensive, error-calculating methods are more commonly used. Examples of error-calculating measures include mean square error (MSE), mean absolute distortion (MAD), and sum of absolute distortions (SAD). These criteria are described in Joan L. Mitchell et al., MPEG Video Compression Standard, International Thomson Publishing (1997), pp. 284–86. SAD is a commonly used matching criterion.SAD is defined as:       SAD    ⁡          (              i        ,        j            )        =            ∑              x        =        0                    M        -        1              ⁢                  ⁢                  ∑                  y          =          0                          N          -          1                    ⁢                          ⁢                                            r            ⁡                          (                              x                ,                y                            )                                -                      s            ⁡                          (                                                x                  +                  i                                ,                                  y                  +                  j                                            )                                                  where block size is M×N, r(x,y) is the current block and s(x+i,y+j) is the candidate block within a search area 115 in the reference frame. The motion vector is the value (i,j) that results in the minimum value for SAD(i,j).
A block-matching algorithm that compares the current block to every candidate block within the search range is called a “full search”. In general, larger search areas generally produce a more accurate displacement vector, however, the computational complexity of a full search is proportional to the size of the search range and is too slow for some applications. A full search block-matching algorithm applied on a macroblock of size 16×16 pixels over a search range of ±N pixels with one pixel accuracy, for example, requires (2×N+1)2 block comparisons. For N=16, 1089 16×16 block comparisons are required. Because each block comparison requires 16×16, or 256, calculations, this method is computationally intensive and operationally very slow. Techniques that simply reduce the size of the search area, however, run a greater risk of failing to find the optimal matching block.
As a result, there has been much emphasis on producing fast algorithms for finding the matching block within a wide search range. Several of these techniques are described in Mitchell et al., pp. 301–11. Most fast search techniques gain speed by computing the displacement only for a sparse sampling of the full search area. The 2-D logarithmic search, for example, reduces the number of computations by computing the MSE for successive blocks moving in the direction of minimum distortion. In a conjugate direction search, the algorithm searches in a horizontal direction until a minimum distortion is found. Then, proceeding from that point, the algorithm searches in a vertical direction until a minimum is found. Both of these methods are faster than a full search but frequently fail to locate the optimal matching block.
Another method for reducing the amount of computation in a full search is to calculate the displacement between blocks using integral projection data rather than directly using spatial domain pixel information. An integral projection of pixel information is a one-dimensional array of sums of image pixel values along a horizontal or vertical direction. Using two 1-D horizontal and vertical projection arrays rather than the 2-dimensional array of pixel information in a block-matching algorithm significantly reduces the number of computations of each block-matching. This technique is described in a paper by I. H. Lee and R. H. Park entitled “A Fast Block Matching Algorithm Using Integral Projections,” Proc. Tencon '87 Conf., 1987, pp. 590–594.
Fast motion estimation techniques are particularly useful when converting from one digital video format to another. Digital video is stored in encoded, compressed form. When converting from one format to another using conventional devices, the digital video must first be decompressed and decoded to its original pixel form and then subsequently encoded and compressed for storage or transmission in the new format. Conversion techniques requiring that digital video be fully decoded are very time-consuming.
The present invention provides improved methods and apparatus for the motion estimation process by performing a fast search that minimizes the number of block comparisons while maintaining the quality of the motion vector. In addition, the present invention provides methods and apparatus for motion estimation using the fast search process of the present invention and integral projection to further minimize the number of computational operations. The present invention further provides methods and apparatus for fast motion estimation using integral projection that allow digital video data conversion from one format to a second format without full decoding to pixel data thereby greatly reducing the time required for data format conversion.