This invention relates generally to motion estimation in video imaging systems and, more particularly, to a method of reducing the search time required to estimate a change in position between images in consecutive video frames.
A video information format provides visual information suitable to activate a television screen, or store on a video tape. Generally, video data is organized in a hierarchical order. A video sequence is divided into group of frames, and each group can be composed of a series of single frames. Each frame is roughly equivalent to a still picture, with the still pictures being updated often enough to simulate a presentation of continuous motion. A frame is further divided into slices, or horizontal sections which helps system design of error resilience. Each slice is coded independently so that errors do not propagate across slices. A slice consists of macroblocks. In H.26P and MPEG-X standards, a macroblock is made up of 16.times.16 luma pixels and a corresponding set of chroma pixels, depending on the video format. A macroblock always has an integer number of blocks, with the 8.times.8 pixel matrix being the smallest coding unit.
Video compression is a critical component for any application which requires transmission or storage of video data. Compression techniques compensate for motion by reusing stored information in previous frames (temporal redundancy). Compression also occurs by transforming data in the spatial domain to the frequency domain. Hybrid digital video compression, exploiting temporal redundancy by motion compensation and spatial redundancy by transformation, such as Discrete Cosine Transform (DCT), has been adapted in H.26P and MPEG-X international standards as the basis. FIG. 1 is a block diagram for a video encoding system (prior art).
Motion estimation is critical in such video compression systems to reduce the flow of transmitted data. Motion estimation is performed over two frames, the current frame to be encoded and the previous coded frame, also called reference frame, to derive video data matching between the two frames. In practice, video compression, including motion estimation, is carried out macroblock wise, to facilitate hardware and software implementations. Motion estimation is performed for each macroblock using a 16.times.16 matrix of luma pixels. Then, the motion estimation results are applied to all the blocks in the macroblock. For a macroblock in the current frame, the best matching area in the last frame is used as the prediction data for the current macroblock, while the prediction error, the residue after subtracting the prediction from the macroblock data, is removed of temporal data redundancy. Temporal redundancy refers to the part of the current frame data that can be predicted from the previous frame. The removal of redundancy, or subtracting prediction values, eliminates the need to encode the repeated part of the data.
After temporal redundancy removal, video compression is further achieved by removing the correlation among neighboring prediction error pixels. This is to eliminate spatial redundancy inherent in video data. This is accomplished by Discrete Cosine Transform(DCT), to obtain compact representation in the frequency domain, and subsequent quantization of transformed coefficients to preserve only significant coefficients. DCT and quantization are performed for each 8.times.8 block independently.
The goal of motion estimation, for each macroblock, is to find a 16.times.16 data area in the previous frame which best represents the current macroblock. There are a variety of motion estimation criteria for the data matching. Ultimately, a criteria which links video source coding to generate the smallest bitstream achieves best compression performance. In practice, motion estimation is a process separated from the source coding of quantized coefficients. That is, motion estimation is performed without checking the final bitstream size, avoiding awkward computational complexity. Furthermore, only luma data is used for motion estimation within each macroblock, and applied to both luma and chroma coding. Handling just luma pixels simplifies procedures, and the human visual system has a higher sensitivity to luminance changes over color changes. Though motion estimation criteria has been investigated in different domains, even in frequency domain, an effective and widely adapted criteria is the sum of absolute difference (SAD). SAD has been found to provide an accurate representation to relate motion estimation with coding efficiency. It is computationally straight forward and much faster than, for example, minimum mean square error measure.
For the macroblock at (x, y) position, the SAD between the current macroblock and a 16.times.16 block in the previous frame offset by (vx, vy) is ##EQU1## where, p(x+i, y+j) is a pixel value in the current macroblock of the current frame, q(x+i+vx, y+j+vy) is a pixel value in the previous frame, in a 16.times.16 block offset by (vx, vy) from the current macroblock. The summation indices i and j cover the area of the macroblock. If SAD(vx, vy) is the minimum in the pre-specified search range, then (vx, vy) is the motion vector for the macroblock.
The motion estimation search range (M, N) is the maximum of (vx, vy), defining a window of data in the previous frame containing macroblock-sized matrices to be compared with the current macroblock. To be accurate, the search window must be large enough to represent motion. On the other hand, the search range must be limited for practical purpose due to high complexity involved in the computation of motion estimation. FIG. 2 is a drawing illustrating the spatial relationship between the current macroblock in the current frame and search window in the previous frame (prior art). If motion vector range is defined to be (M, N), then the search window size is (16+2M, 16+2N). For TV or movie sequences, the motion vector range needs to be large enough to accommodate various types of motion content. For video conferencing and videophone applications, the search range can be smaller. Therefore, the choice of search range is a combination of application and availability of deliverable technology.
Given a motion estimation search range, the computational requirement is greatly affected by the exact method of covering the search window to obtain motion vectors. An exhaustive search technique, full motion estimation search, covers all the candidate blocks in the search window to find the best match. In this case, it requires (2M+1).times.(2N+1) calculations of the cost function to obtain motion vector for each macroblock. This computation cost is prohibitive for software implementations.
Different schemes have been used to reduce the computation cost, such as telescopic search and step search. The fixed-step search method has a good balance of effectiveness and complexity, and has been widely used in estimating motions. FIG. 3 is a flow chart illustrating the fixed-step motion estimation method (prior art).
The fixed-step motion estimation method uses a fixed number of steps inside a search window to find the best match, with a smaller scale search window for each next step. The method starts with 9 points uniformly distributed in the valid area of target search window in the previous frame. Each of the 9 points represents the upper-left corner of the macroblock-sized area, or matrix of luma pixels. The matrix with the minimum SAD in the current step is used as the starting point (new center of search grid) for the next step. The next step is performed in the same way, with half the distance between search points, or matrices. If the space between search points in the last step is w, then it is w/2 for the current step. This procedure is continued until the last step, in which all the 9 search points are adjacent and no more zooming-in is possible for integer pixels. FIG. 4 is an example illustrating the last 3 steps in the step search method of FIG. 3 (prior art). The example covers motion vector range of (-7, +7) in both horizontal and vertical directions. If the motion vector range is (-15, +15), then 4 steps are needed.
For a motion vector range of (-15, +15), the number of cost function calculations is (9+8+8+8)=33. That is, 9 SAD calculations are made in the first step. Since the matrix with the lowest SAD is included (as the center) of the second search step, only 8 SAD calculations are needed. Likewise, only 8 SAD calculations are needed in steps 3 and 4. Compared to the number of calculations required to check the SAD of every matrix in the 16.times.16 pixel search window (full search), the number of calculations required in the fixed-step method is small. It takes (2*15+1) * (2*15+1)=961 cost function calculations in a full search. Comparing with full search, the computation is dramatically reduced to 3.4%.
However, the fixed-step method is not responsive to the accuracy of the initial starting matrix in the previous frame, or the accuracy the motions estimates made for neighboring macroblocks. For example, if there is no motion in a macroblock between the previous frame and the current frame, then the starting matrix in the search is likely to have the lowest SAD. Is this situation, it is wasteful to perform all 33 computations, as described above. On the other hand, when there is a great deal of motion between frames, the fixed-step estimation method as described above, may be unable to find the matrix with the best SAD despite making 33 computations. For example, the matrix with the best SAD may be outside the initial search area defined by an 32.times.32 matrix. In this situation it would be better if the original search area was defined by a 48.times.48 matrix. Permanently setting the initial fixed-step search window to accommodate a 48.times.48 matrix is possible, however, then every estimation would require 5 steps, or 41 computations (9+8+8+8+8), which is wasteful in the average, low change, motion scenario.
It would be advantageous if a method was available to reduce the search window size, reducing the number of computations needed, for use with the fixed-step method of motion estimation when there is only a slight motion in the image represented in 2 successive video frames.
It would be advantageous if a method was available to increase the window size for use with fixed-step motion estimation when there is dramatic motion in the image represented in 2 successive frames.
It would be advantageous if a method was available for use with the fixed-step method of motion estimation that adjusted the search window size in response to the average change in position, as calculated from the estimation of motion in neighboring macroblocks.
Accordingly, a method for efficiently estimating the change in position of an image represented by a matrix of luma pixel data in a series of blocks in the current frame, from corresponding block-sized matrices of luma pixel data in the previous frame, is provided. The method applies to a digital video system compression format where a video sequence is represented in series of frames, including a previous frame followed by a current frame, all separated by a predetermined time interval. The frames are divided into a plurality of blocks with predetermined positions, with each block having a size to include a predetermined matrix of luma pixels. The method comprising the steps of:
a) selecting a first block in the current frame; PA1 b) selecting a block-sized matrix of luma pixels in the previous frame as an initial candidate matrix corresponding to the first block in the current frame; PA1 c) providing a short term average comparison of luma pixel data between frames, derived from previous block position change estimates; PA1 d) calculating a search window size, centered about the candidate matrix, in response to a short term average comparison of luma pixel data presented in Step c); and PA1 e) comparing the luma pixel data from a plurality of block-sized matrices of luma pixels uniformly distributed inside the search window, to the luma pixel data of the first block in the current frame, to select a new candidate matrix having luma pixel data most similar to the luma pixel data of the first block in the current frame, whereby the size of the search window varies with the history of motion between frames. PA1 f) reducing the spacing between the plurality of block-sized matrices located inside the search window after each iteration of Step e); PA1 g) repeating Steps e)-f) until the spacing between the plurality of block-sized matrices matches the size of a user-defined minimum spacing, to select a final candidate matrix in the final iteration of Step e); and PA1 h) comparing luma pixel data of the final candidate matrix selected in the final iteration of Step e) to the luma pixel data of the first block in the current frame, to calculate a final comparison of luma pixel data, whereby the difference in block position between the final candidate matrix and the first block provides a vector describing motion between frames.
Addition steps are included, following Step e), of:
After the final candidate matrix is found, the short term average comparison is updated with the final comparison of luma pixel data calculated in Step h). The search window size is also calculated with the use of a long term average of search window sizes. The long term search window size is updated in response to the calculation of a new short term average comparison of luma pixel data. The long term average search window size is also calculated with a long term average comparison of luma pixel data, which is updated after Step h).
Typically, the luma pixel data is compared through a calculation of the sum of absolute differences (SAD) of luma pixel data. In Step e), the block-sized matrix with the smallest SAD is selected as the candidate matrix in the next iteration of Steps e)-f), and Step h) includes a calculation of the minimum SAD (SAD.sub.-- min) as the final comparison of luma pixel data. Further, SAD information is used to create both the short term (SAD.sub.-- ave), and long term average (SAD.sub.-- aveLT) comparisons of luma pixel data. The total number of macroblocks in a frame are also used to calculate SAD.sub.-- ave, and the total of macroblocks from several frames is used to calculate SAD.sub.-- aveLT.
The method includes defining the search window size in terms of the number of iterations (ME.sub.-- step) of Steps e)-f) required until the spacing between block-sized matrices is the minimum spacing. Then, Step e) includes initially distributing the plurality of block-sized matrices compared in the search window in response to the value of ME.sub.-- step, and Step f) includes halving the search window size every iteration.