1. Field of the Invention
The present invention relates to the field of video compression, and in particular to motion vector estimation.
2. Description of the Related Art
In video processing, video image compression is often necessary to overcome the bandwidth constraints of the transmission medium interposed between the video transmitter and the video receiver. For example, a typical video source might transmit a 320.times.240 pixel image at a rate of approximately 30 frames per second, using 12 bits per pixel. Based on these figures, it can be appreciated that a video signal requires a transmission rate on the order of tens or hundreds of megabits per second. In contrast, a medium such as a conventional analog telephone line only has a channel capacity of approximately 28.8 kilobits per second.
Generally, two techniques are used for video compression--intraframe compression and interframe compression. Intraframe compression takes advantage of the redundancy of the information within one frame. Interframe compression encodes the relative positions between frames. Examples of intraframe compression include run-length encoding, Fast Fourier Transform, Discrete Cosine Transform, Discrete Walsh Transform and fractal compression.
Interframe compression takes advantage of the correlation between frames, and is most advantageous when there is little frame-to-frame movement of the image. Interframe compression is especially useful in an application such as video conferencing, where motion tends to be slow and may only involve the movement of the mouth and neck of the person speaking, for example. One of the most common interframe compression techniques is motion estimation. This method assumes that a frame denoted the reference frame (F1) is followed in time by a subsequent frame denoted the search frame (F2). The search frame F2 is subdivided into blocks. The reference frame F1 is subdivided into search areas. Each F1 search area corresponds to an F2 block. Search areas may overlap. According to convention, the absolute coordinates of the reference and search frames have their (0, 0) reference point at the upper left corner of the frame.
The motion estimation model assumes that the image in F2 is a translation of the image in F1, and that the image content does not change or that such a change is compensated for by a well-known technique known as "residual compensation." Because F2 is a translation of F1, an F2 block must be located somewhere within a search area in F1. Given some a priori knowledge of the image velocity, the encoder designer would select an appropriate size for the search area. For example, a slow-moving image would require a search area smaller in size than a fast-moving image because the frame-to-frame image motion would cover a shorter distance.
The motion estimation technique results in the generation of a motion estimation vector. Each motion vector is associated with an F2 block. The motion vector represents the relative coordinates in the F1 search area at which a particular F2 block may be found. The motion vector specifies the relative translation that must be performed from the F2 block coordinates to find the F1 block that contains the corresponding F2 image. For example, if the motion vector for the F2 block located at F2 coordinates (7, 20) is (3 -4), then the corresponding image in the F1 search area that corresponds to the F2 block is found by moving three pixels to the right and four pixels up (the negative direction is up in standard video notation) in the F1 frame. Accordingly, the corresponding F1 frame is located at (10, 16).
As is known in the art, the motion vector for each F2 block is calculated by correlating the F2 block with the block's corresponding F1 search area. For example, the F2 block may be scanned over the search area pixel-by-pixel. In that case, the F2 block is overlaid at an initial position within the F1 search area. An F1-F2 correlation error between the pixel intensity values of the F2 block and the overlaid F1 search area is then calculated. The error measure may be the mean absolute error or the mean square error, for example. A wide variety of other error measures may, of course, be employed. The F2 search block may then be moved one pixel horizontally or vertically within the F1 search area. The error at that position is then calculated. This process continues until the error between the F2 block and the F1 search area has been calculated for every position within the search area. The minimum error over all positions represents the highest correlation between the F2 block and the F1 search area. The (x, y) translation of the F1 position associated with the minimum error is selected as the motion vector v for the corresponding F2 search block. The term "motion vector" may generally refer to any (x, y) translation vector within the search area. However, the motion vector v will be interchangeably referred to as the "minimum-error motion vector," or just the "motion vector." The meaning of the term "motion vector" will be clear from the context herein.
The procedure to generate the minimum-error motion vector is repeated for the next F2 block until motion vectors for each F2 block have been calculated. One common motion estimation scheme that follows this procedure is the ISO/IEC 11172-2 MPEG (Motion Picture Express Group) standard. Those skilled in the art will recognize that, after reading this disclosure, the present invention applies not only to motion vectors formed according to the pixel-by-pixel scan described above, but may be extended to motion vectors generated by any scheme.
The motion vectors are used as follows. The encoder at the transmitter first transmits the reference frame F1. This of course consumes a large amount of time and bandwidth. However, subsequent images may be represented by motion vectors. The motion vectors for the next frame, F2, are then transmitted. Motion vectors for subsequent frames with respect to F1 may then be transmitted over the transmission medium. At some point, however, the source image may change entirely or undergo a large movement outside the search area boundaries, which would require the entire image to be transmitted. After that point, subsequent images may be represented by motion vectors as before.
Unfortunately, even after undergoing compression through motion vector estimation, the encoded image bandwidth may still exceed channel capacity. For example, a 14.4 kilobit per second telephone modem typically may have only 9600 bits per second available for video information. Ten frames are transmitted each second, leaving 960 bits per second to transfer the motion vectors for an entire frame. A typical search area runs .+-.8 pixels in both the x and y directions, which is the equivalent of 16 pixel positions in both directions. Transmission of this information requires eight bits. Each motion vector must represent the translation of the F2 block within the search area. Thus, eight bits per motion vector are required. A typical frame size is on the order of 320.times.240 pixels, which can be divided into 1200 8.times.8 blocks (a typical size). Thus, the number of bits to transmit one motion vector-encoded frame is 1200 motion vectors per frame.times.8 bits per motion vector=9600 bits per frame. Comparing this number to the 960 bits available to transfer the motion vectors for a frame reveals that further compression is required.
It can thus be appreciated that a need exists for a secondary compression technique to compress motion vector information to an acceptable bandwidth.