1. Field of the Invention
The invention relates generally to the field of processor chips and specifically to the field of single-instruction multiple-data (SIMD) processors. More particularly, the present invention relates to motion estimation in a SIMD processing system.
2. Description of the Background Art
Motion estimation is a basic bandwidth compression method used in video-coding systems. Motion estimation is used by MPEG-1, MPEG-2, MPEG-4, H.261, H.263, and H.264 video compression standards. Block matching using Sum-of-Absolute-Differences (SAD) between a reference block of 16 by 16 luma pixels and a candidate block of 16 by 16 pixels is used, because it is easier to implement SAD instead of the Mean-Square-Error (MSE). SAD subtracts all corresponding pixel values, takes the absolute value of these differences and then sums up all the 256 values together. The lower the value the better the match is, and zero represents the best match. Motion estimation is done by testing different candidate positions to see which one best match the reference block.
Besides video encoding, other applications also use motion estimation including video stabilization in digital camcorders, stitching of multiple digital shots together, and Automatic Target Recognition (ATR) in military applications.
Block matching using SAD of 16 by 16 blocks is the task that requires by far the most processing requirement in video compression. Current systems use dedicated hardware blocks with different levels of parallelism to calculate SAD, because SAD processing requirement exceeds the fastest RISC or DSP processing power. For example, to calculate the SAD for a full search window of +/−32 window in both horizontal and vertical dimensions requires about 152 Billion operations for 30 frames per second at CCIR-601 resolution that have 1620 such reference block.
Therefore, most high-quality video encoding chips have dedicated hardware blocks that calculate a list of motion-vectors indicating best match values for each luma block in a video frame. Even in this case, smaller search areas and hierarchical search at lower resolution is sometimes used to lower processing requirements. For example, first every fourth pixel position is search for the best possible match. Then only the neighborhood of best match is search. This cuts down processing by a factor of 16.
The problem with such dedicated hardware blocks is that they lack the flexibility of a software solution, and also require large amounts of data to be “shipped” to dedicated motion-estimation hardware and the results to be read by a processor.
SIMD and VLIW processors that exist today also perform motion estimation, but based on reduced search areas based on certain assumptions. One such approach is calculating only the neighborhood of expected locations based on neighborhood blocks that are already processed. Such processor can calculate SAD values for 8, 16, or 32 pixels at each clock cycle. Also, the bookkeeping of X-Y locations and best-match values are performed as scalar operations, whereby further reducing the efficiency of software implementation. This is because during these scalar operations most of parallel hardware execution units stay idle.
Reduced search areas and imperfect search results do not cause incorrect results, but reduce video compression and thus the resultant video quality. New video coding techniques also require sub-pixel level block matching to as low as ⅛th pixel resolution. This further increases the complexity of both hardware and software solutions.