The invention relates to video coding, and specifically, to an improved method for performing motion estimation in video coding applications.
Full-motion video displays based upon analog video signals have long been available in the form of television. With recent advances in computer processing capabilities and affordability, full-motion video displays based upon digital video signals are becoming more widely available. Digital video systems can provide significant improvements over conventional analog video systems in creating, modifying, transmitting, storing, and playing full-motion video sequences.
Digital video displays include large numbers of image frames that are played or rendered successively at frequencies of between 30 and 75 Hz. Each image frame is a still image formed from an array of pixels based on the display resolution of a particular system. As examples, VHS-based systems have display resolutions of 320xc3x97480 pixels, NTSC-based systems have display resolutions of 720xc3x97486 pixels, and high-definition television (HDTV) systems under development have display resolutions of 1360xc3x971024 pixels.
The amounts of raw digital information included in video sequences are massive. Storage and transmission of these amounts of video information is infeasible with conventional personal computer equipment. Consider, for example, a digitized form of a relatively low resolution VHS image format having a 320xc3x97480 pixel resolution. A full-length motion picture of two hours in duration at this resolution corresponds to 100 gigabytes of digital video information. By comparison, conventional compact optical disks have capacities of about 0.6 gigabytes, magnetic hard disks have capacities of 1-2 gigabytes, and compact optical disks under development have capacities of up to 8 gigabytes.
To address the limitations in storing or transmitting such massive amounts of digital video information, various video compression standards or processes have been established, including MPEG-1, MPEG-2, and H.26X. These video compression techniques utilize similarities between successive image frames, referred to as temporal or interframe correlation, to provide interframe compression in which motion data and error signals are used to encode changes between frames.
In addition, the conventional video compression techniques utilize similarities within image frames, referred to as spatial or intraframe correlation, to provide intraframe compression in which the image samples within an image frame are compressed. Intraframe compression is based upon conventional processes for compressing still images, such as discrete cosine transform (DCT) encoding. This type of coding is sometimes referred to as xe2x80x9ctexturexe2x80x9d or xe2x80x9ctransformxe2x80x9d coding. A xe2x80x9ctexturexe2x80x9d generally refers to a two-dimensional array of image sample values, such as an array of chrominance and luminance values or an array of alpha (opacity) values. The term xe2x80x9ctransformxe2x80x9d in this context refers to how the image samples are transformed into spatial frequency components during the coding process. This use of the term xe2x80x9ctransformxe2x80x9d should be distinguished from a geometric transform used to estimate scene changes in some interframe compression methods.
Interframe compression typically utilizes motion estimation and compensation to encode scene changes between frames. Motion estimation is a process for estimating the motion of image samples (e.g., pixels) between frames. Using motion estimation, the encoder attempts to match blocks of pixels in one frame with corresponding pixels in another frame. After the most similar block is found in a given search area, the change in position of the pixel locations of the corresponding pixels is approximated and represented as motion data, such as a motion vector. Motion compensation is a process for determining a predicted image and computing the error between the predicted image and the original image. Using motion compensation, the encoder applies the motion data to an image and computes a predicted image. The difference between the predicted image and the input image is called the error signal. Since the error signal is just an array of values representing the difference between image sample values, it can be compressed using the same texture coding method as used for intraframe coding of image samples.
Although differing in specific implementations, the MPEG-1, MPEG-2, and H.26X video compression standards are similar in a number of respects. The following description of the MPEG-2 video compression standard is generally applicable to the others.
MPEG-2 provides interframe compression and intraframe compression based upon square blocks or arrays of pixels in video images. A video image is divided into image sample blocks called macroblocks having dimensions of 16xc3x9716 pixels. In MPEG-2, a macroblock comprises four luminance blocks (each block is 8xc3x978 samples of luminance (Y)) and two chrominance blocks (one 8xc3x978 sample block each for Cb and Cr).
In MPEG-2, interframe coding is performed on macroblocks. An MPEG-2 encoder performs motion estimation and compensation to compute motion vectors and block error signals. For each block MN in an image frame N, a search is performed across the image of a next successive video frame N+1 or immediately preceding image frame Nxe2x88x921 (i.e., bi-directionally) to identify the most similar respective blocks MN+1 or MNxe2x88x921. The location of the most similar block relative to the block MN is encoded with a motion vector (DX,DY). The motion vector is then used to compute a block of predicted sample values. These predicted sample values are compared with block MN to determine the block error signal. The error signal is compressed using a texture coding method such as discrete cosine transform (DCT) encoding.
Object-based video coding techniques have been proposed as an improvement to the conventional frame-based coding standards. In object-based coding, arbitrary shaped image features are separated from the frames in the video sequence using a method called xe2x80x9csegmentation.xe2x80x9d The video objects or xe2x80x9csegmentsxe2x80x9d are coded independently. Object-based coding can improve the compression rate because it increases the interframe correlation between video objects in successive frames. It is also advantageous for variety of applications that require access to and tracking of objects in a video sequence.
In the object-based video coding methods proposed for the MPEG-4 standard, the shape, motion and texture of video objects are coded independently. The shape of an object is represented by a binary or alpha mask that defines the boundary of the arbitrary shaped object in a video frame. The motion of an object is similar to the motion data of MPEG-2, except that it applies to an arbitrary-shaped image of the object that has been segmented from a rectangular frame. Motion estimation and compensation is performed on blocks of a xe2x80x9cvideo object planexe2x80x9d rather than the entire frame. The video object plane is the name for the shaped image of an object in a single frame.
The texture of a video object is the image sample information in a video object plane that falls within the object""s shape. Texture coding of an object""s image samples and error signals is performed using similar texture coding methods as in frame-based coding. For example, a segmentedimage can be fitted into a bounding rectangle formed of macroblocks. The rectangular image formed by the bounding rectangle can be compressed just like a rectangular frame, except that transparent macroblocks need not be coded. Partially transparent blocks are coded after filling in the portions of the block that fall outside the object""s shape boundary with sample values in a technique called xe2x80x9cpadding.xe2x80x9d
In both frame-based and object-based video coding, the process of motion estimation is one of the most important parts of the coding system in terms of both the speed of the encoding process as well as the quality of the video. Both the H263 and MPEG-4 coding standards perform motion estimation on macroblocks. The goal of the motion estimation process is to find the macroblock in a reference picture that results in the smallest error signal after motion compensation. By minimizing the error signal, the encoder attempts to minimize the number of bits needed to code the macroblock. However, in addition to coding the error signal, the encoder must also code the macroblock header and motion vectors. While minimizing the error signal may minimize the number of bits needed to encode the error signal, it does not necessarily result in the most efficient coding of the overall macroblock.
The invention provides an improved method for performing motion estimation. One aspect of the invention is a method for performing motion estimation that improves the coding efficiency by using a measure of the combined motion and error data to select the motion parameters for a block (e.g., the motion vector). This modified search criteria takes into account the overhead associated with coding the motion parameters for a block as well as the error signal.
An encoder implementation uses the measure of the combined motion and error signal data as the search criteria for finding a matching block of pixels in the motion estimation process. Using a block matching scheme, the encoder searches for a matching block in a target frame for a source block in a source frame. The objective of the search is to find a block of pixels in the target frame that minimizes the combined motion and error signal coding overhead for the source block. By using this modified search criteria, the encoder can achieve better coding efficiency.
Another aspect of the invention is a method for performing pixel block matching that improves encoding speed by selecting a more efficient search path for the matching process. In particular, this method arranges the search order used in the block matching process so that pixels that are closer to a desired starting point (e.g., a predicted point) are searched first before pixels located farther from the desired starting point.
An implementation designed for the MPEG-4 coding standard uses this approach to shift the search order of blocks in a target frame so that blocks closer to a desired starting point are searched first. In the coding standards like MPEG-4, the need arises to optimize the search path because they have restrictions that limit the motion vector size, which in turn, leads to a less than optimal search starting point. The starting point of the search is not optimal because it is derived from motion vectors that are limited in size. Shifting the search order corrects for the less than optimal starting point.
Finally, another aspect of the invention is a method for block matching that uses a search path and search criteria that reduces the amount of searching needed to compute the motion parameters for a block of pixels. An implementation of this method uses the modified search criteria outlined above along with a spiral search path. Based on the attributes of the modified search criteria, the encoder can determine whether it has found a target block that minimizes the modified search criteria without searching all target blocks in the search area. Thus, this approach improves the performance of the encoder by speeding up the search in the block matching process.
Further advantages and features will be apparent from the following detailed description and accompanying drawings.