MPEG-2 is a conventional standard for digital video compression. MPEG-2 is based upon interframe compression. The theory behind interframe compression is that in most video scenes, the background remains relatively stable while actions takes place in the foreground. Thus, even if the background moves, most of the video information from scene to scene (on a frame by frame basis) is redundant.
The MPEG video compression algorithm employs two basic techniques, namely, block-based motion compensation for the reduction of temporal redundancy, and transform domain (DCT) coding for the reduction of spatial redundancy. The motion compensation technique is employed in the forward (causal) and backward (non-causal) direction. The remaining signal (prediction error) is coded using the transform-based technique. The motion predictors, called motion vectors, are transmitted together with the spatial information.
To understand temporal redundancy reduction, it is necessary to understand an MPEG video stream. There are three types of picture frames in an MPEG-2 video stream, namely, I frames (also referred to as “Intra” frames or reference frames), P (predicted) frames, and B (bi-directional interpolated) frames. The relationship between the frames is shown in FIG. 1.
To clarify terminology used herein, MPEG-2 refers to a “picture” as either a frame or a field. Therefore, a coded representation of a picture may be reconstructed to a frame or a field. During the encoding process, the encoder may code a frame as one frame picture or two field pictures. If the frame is encoded as field pictures, each field is coded independently of each other. That is, two fields are coded as if they were two different pictures wherein each picture has one-half of the vertical size of a frame. The discussion below interchangeably refers to pictures and frames.
MPEG-2 starts the compression process by creating an I frame or reference frame. The I frames contain the entire frame of video and are placed every 10 or 15 frames. Only a small portion of the frames that fall between the I frames is different from the rest of the I frames. Only these differences are captured, compressed and stored. I frames provide entry points into a video file to allow for random access. I frames can only be moderately compressed.
P frames are encoded with reference to a past frame, which can be either an I or P frame. Generally, P frames are used as a reference to future P frames. P frames are highly compressed.
B frames are encoded with reference to a past and future frame. B frames are the most highly compressed of the three types of frames. B frames are never used as the references. There is no limit to the number of B frames allowed between the two references, or the number of frames between two I frames.
Motion compensation prediction assumes that the current picture can be locally modeled as a translation of the pictures of some previous time. According to the MPEG standard, the reference picture is divided into a grid of 16×16 pixel squares called macroblocks. Each subsequent picture is also divided into these same macroblocks. A computer then searches for an exact, or near exact, match between the reference picture macroblock and those in succeeding pictures. When a match is found, the computer transmits only the difference through a “vector movement code” or “motion vector.” Stated simply, the motion vector tells us where the macroblock moved to from its original position. The macroblocks that did not change are ignored. Thus, only the non-zero motion vectors are subsequently “coded.” Accordingly, the amount of data that is actually compressed and stored is significantly reduced.
The MPEG syntax specifies how to represent motion information for each macroblock, but does not specify how the motion vectors must be computed. Many conventional motion vector computation schemes use block-matching. In block-matching, the motion vector is obtained by minimizing a cost function which measures the mismatch between the reference block and the current block. One widely used cost function is the absolute difference (AE) defined as:       AE    ⁡          (                        d          x                ,                  d          y                    )        =            ∑              i        =        0            15        ⁢                  ∑                  j          =          0                15            ⁢                                            f            ⁡                          (                              i                ,                j                            )                                -                      g            ⁡                          (                                                i                  -                                      d                    x                                                  ,                                  j                  -                                      d                    y                                                              )                                                  wherein f(i,j) represents a macroblock of 16×16 pixels from the current picture, and g(i,j) represents the same macroblock from a reference picture. The reference macroblock is displaced by a vector (dx,dy), representing the search location.
To find the best matching macroblock which produces the minimum mismatch error, the AE is calculated at several locations in the search range. The conceptually simplest, but the most computer-intensive search method, is known as the “full search” or “exhaustive search.” This search evaluates the AE at every possible pixel location in the search area. Less computationally complex algorithms may also be used. One conventional algorithm is the Three-Step-Search (TSS). This algorithm first evaluates the AE at the center and eight surrounding locations of a 32×32 search area. The location that produces the smallest AE then becomes the center of the next stage, and the search range is reduced by half. This sequence is repeated three times. The TSS skips a lot of pixels, and thus does not always accurately locate the best matching macroblock.
After motion compensation is completed, spatial redundancy reduction is performed using DCT to obtain quantized DCT coefficients. Then, entropy coding is performed on quantized DCT coefficients.
As discussed above, conventional schemes for finding motion vectors during motion compensation are either very computationally intensive (e.g., full search) or suffer from accuracy problems (e.g., Three-Step-Search). Accordingly, there is a need for a motion estimation algorithm which is less computationally intensive than a full search, but which does not suffer from accuracy problems. The present invention fulfills such a need.