In recent years, it has become increasingly desirable and practical to communicate digital video information—sequence of digital images—from one point to another. Indeed, transmission of video over computer networks, such as the World-Wide-Web portion of the Internet, from one computer to another is uncommon in digital television set-top boxes, DSS, HDTV decoders, DVD Players, Video Conferencing, Internet Video and other such applications. Since a single frame of video can consist of thousands or even hundreds of thousands of bits of information, it can take a considerable amount of time to transmit a sequence of frames from one point to another.
To reduce transmission costs, computers and other devices that transmit and receive digital video data generally include a video compression system. The video compression system typically includes an encoder for compressing digital video data from its raw form and a corresponding decoder at the receiver end for decompressing the compressed frame.
Video compression typically takes advantage of the redundancy within and between sequential frames of video data to reduce the amount of data ultimately needed to represent the sequence. The DPCM/DCT (Differential Pulse-Coded Modulation/Discrete Cosine Transform) hybrid coding technique has proved to be the most effective and successful for video compression. All current international standards, namely ITU H.261 and H.263, ISO MPEG I and II, have adopted this coding structure. In a hybrid video coder, prediction coding is used to reduce the temporal redundancy, and DCT is applied to the prediction error signal to eliminate the remaining spatial redundancy.
Motion estimation can be classified into two categories, namely the block-matching and pel-recursive (See H. G. Musmann, P. Hirsch, and H. J. Grallert, “Advances in picture coding,” Proc. IEEE, pp. 523-548, April 1985, and M. Orchard, “A comparison of techniques for estimating block motion in image sequence coding,” Proc. SPIE Visual Commun. and Image Processing, pp. 248-258, 1989). Because hybrid video coders are block-based and block-matching methods need much less complexity than pel-recursive to implement, only block matching has been considered for current practical video compression systems.
In hybrid coding, a video frame to be encoded is partitioned into non-overlapping rectangular, or most commonly, square blocks of pixels. The DCT domain operations are based on block sizes of 8×8 pixels. Motion compensation operates on macroblocks of 16×16 pixels. For each of these macroblocks, the best matching macroblock is searched from a reference frame in a predetermined search window according to a predetermined matching error criterion. Then the matched macroblock is used to predict the current macroblock, and the prediction error macroblock is further processed and transmitted to the decoder. The relative shifts in the horizontal and vertical directions of the reference macroblock with respect to the original macroblock are grouped and referred to as the motion vector of the original macroblock, which is also transmitted to the decoder. The main aim of motion estimation is to predict a macroblock such that the difference macroblock obtained from taking a difference of the reference and current macroblocks produces the lowest number of bits in DCT encoding.
The most straightforward method to search for the motion vector is the brute-force, global full-search (FS) method. In the FS method, all possible candidate locations in the search window are used to find the best match. Although this method can produce the best motion vector according to predetermined matching criterion, it is usually too complex to implement for real-time applications at a reasonable cost. To this end, various less complex methods have been proposed and studied to either reduce the complexity of evaluating the match error at each search location or to reduce the number of search locations, or both.
One of the most efficient current motion estimation techniques uses a two-stage approach. In the first stage a local search is made around a prospective candidate (see Junavit Chalidabhongse, C. C. Jay Kuo, “Fast Motion Vector Estimation using Multi-Resolution-Spatio-Temporal Correlations,” IEEE Transaction on circuits and systems for video technology, Vol. 7, No 3, June 1997). The prospective candidate is chosen from the spatio-temporal neighborhood of the current macroblock (16×16 pixels). If the distortion measurement at any step is less than a predefined threshold, the corresponding motion vector is selected as the motion vector of the current macroblock. This local search method is allowed to operate for a predefined number of steps. If after all of these steps, no favorable motion vector is obtained, then an FS is executed with an origin around (0,0) motion vector. Unlike a local search, FS does not have any intermediate stopping criteria. It will calculate distortion measurement for all motion vectors in the search area and select the motion vector corresponding to the lowest distortion.
The problem with this approach is the selection of a reasonable fixed pre-defined threshold for stopping criteria during a local search for all macroblocks. If the selected predefined threshold is relatively high, the motion estimation search process can stop prematurely, selecting a non-optimal motion vector. This can result in generating a higher variance for the difference macroblock than the original and the encoder will be forced to do intra coding (Intra frames/blocks are coded without prediction information) for the current macroblock. This can lead to lower quality for Constant Bit Rate (CBR) encoding or it can result in a lower compression ratio for the Variable Bit Rate (VBR), if the selected pre-defined threshold value is relatively low. The local search process may not be able to satisfy the stopping criteria through an optimal or near optimal motion vector. This can again lead to selecting the motion vector through FS and this in turn can considerably increase search time. In reality, the threshold varies from one macroblock to the other. Therefore, choosing a fixed threshold can affect the quality of compression and encoder performance.