Multimedia containing various content types including text, audio and video, provides an outstanding business and revenue opportunity for network operators. The availability of higher bandwidth and the use of packet-switched Internet Protocol (IP) technology have made it possible to transmit richer content that include various combinations of text, voice, still and animated graphics, photos, video clips, and music. In order to capitalize on this market potential network operators must meet customers' expectations regarding quality and reliability. Transcoding of media at server level is crucial for rendering multimedia applications in today's heterogeneous networks composed of mobile terminals, cell phones, computers and other electronic devices. The adaptation and transcoding of media must be performed at the service provider level because individual devices are often resource constrained and are rarely capable of adapting the media themselves. This is an important problem for service providers, as they will have to face a very steep traffic growth in the next few years; growth that far exceeds the speed up one can obtain from new hardware alone. Using a brute-force approach of increasing the number of servers is not sufficient. Moreover, an increase in the number of servers leads to proportional increases in power consumption, heat dissipation and space. Another way to improve system performance and handle the large growth in traffic is to devise smart techniques for video coding that forms an important and resource intensive phase of multimedia adaptation.
Motion compensated video coding processes scenes consisting of blocks and each block consists of a number of pixels. Essentially all modern video codecs use motion compensated coding where frames are encoded relative to a number of preceding frames to exploit temporal dependencies and get better compression. The most intensive phase of movement compensated video coding is the movement estimation phase. This is performed through a movement estimation algorithm that estimates the scene's objects displacements from one frame to the next. These estimations are used to create a synthetic frame where the scene is deformed to match the estimated movement of objects. That synthetic frame is used as a predictor for the current frame, which is differentially encoded. Such movement estimation algorithms are computationally intensive and account for a very large part of the encoder's runtime, increasingly so with resolution, making it a natural target for optimization.
A considerable amount of effort has been directed towards the problem of block-based movement estimation, a simplification to the general problem where the prediction frame is constructed from small rectangular regions copied from reference frames. A discussion of block-based movement estimation is provided next. For the explanation provided in this document we assume that the basic blocks are 16×16 pixels. Note that the same concepts are applicable for blocks of different sizes. The objective of the system is to produce a predicted frame for the current frame being encoded. This predicted frame is generated by differentially encoding the current frame from a given reference frame. For each 16×16 block in the current frame, the system looks for the best matching block in the reference frame. The search examines a number of blocks (not necessarily aligned on 16×16 boundaries) in the reference frame and selects the block that minimizes the difference with the current block. The motion vector, a key element in the motion estimation process, is simply the offset to the best matching block (in the reference frame) relative to the current block's position (in the current frame). The best matching block is then copied into the compensated frame or predicted frame at the current block's position. After this process, the predicted frame is the best approximation (according to the chosen metric measuring the difference between image blocks) one can build from the reference frame considering that only block copies are allowed. The compensated frame is used as the predictor to differentially encode the current frame.
A brief discussion of selected prior art references is presented below. Research has taken a number of different directions. S. Borman, M. Robertson, R. L Stevenson “Block Matching Sub-Pixel Motion Estimation from Noisy, Undersampled Frames” SPIE Visual Communications and Image Processing Conference 1999, presents an empirical study that concerns the effects of noise or sampling error in SAD, MSE, and NCF. The paper, W. Li, E. Salari, “Successive Elimination Algorithm for Motion Estimation”, IEEE Transactions on Image Processing, Volume 4, Issue 1, January 1995, pages 105-107, explores the properties of SAD and MSE for devising a dynamic-programming like method for fast motion estimation. The authors focus on an algorithm, which does not require an exhaustive search in the solution space and discusses how properties of existing metrics are to be used; they do not propose any new metric. F. Tombari, S. Mattocia, L. di Stefano, “Template Matching Based on Lp Norm Using Sufficient Conditions with Incremental Approximation”, IEEE International Conference on Video and Signal Based Surveillance, November 2006, page 20, extends the work of Li and Salari. The paper uses a similar dynamic-programming approach to compute a fast version of a metric.
U. Koc and K. J. R. Liu, “Interpolation-free Subpixel Motion Estimation Technique in DCT Domain”, IEEE Transactions on Circuits and Systems for Video Technology, Volume 8, Issue 4, August 1998, pages 460-487 focuses on a subpixel level and tries to avoid subpixel interpolation in the space domain by using techniques in the DCT domain that are at least as complex as the techniques used in the space domain. The metric is extended appropriately for handling the shift to the DCT domain. S. Lee, S.-Ik Chae, “Two-step Motion Estimation Algorithm using Low Resolution Quantization”, International Conference on Image Processing, Volume 3, September 1996, pages 795-798, focuses on motion estimation techniques. This paper presents a “fail fast” approach to SAD matching. The image is first quantized so that the precision of each pixel is reduced, for example from 8 bits per pixels to 4 bits per pixel. A first function compares the two blocks using the reduced precision version. If the results are acceptable, it proceeds to using a full precision metric. Although the research is presented with a hardware implementation in mind, it does not consider the effective utilization of a Single Instruction Multiple Data (SIMD) instruction set that includes SAD when the processor running the code provides such a facility. An important aspect of this invention is to reduce the time required in the computation of the metric by using such performance optimizing SIMD instruction sets that are provided in commercial processors available in the market today.
The research reported in C.-K. Cheung, L.-M. Po, “A Hierarchical Block Motion Estimation Algorithm using Partial Distortion Measure” International Conference on Image Processing, Volume 3, October 1997, pages 606-609 uses pixel sampling by using regular grid sampling, which is strictly equivalent to ordinary sub-sampling. They compute SAD/MSE using ½ or ¼, of the pixels (either in a quincunx pattern, or one in two columns, one in two rows). Blocks are checked against a ¼ grid SAD. If it is among the n better ones, it is kept for the next round, when a ½ grid density will be used. Of the n better ones obtained from the previous round, m will be retained, and thoroughly checked with a full SAD. Unfortunately, the approach proposed by Cheung and Po cannot effectively utilize SIMD type parallel operations.
The research reported in Y.-L. Chan, W.-C. Siu, “New Adaptive Pixel Decimation for Block Motion Vector Estimation”, IEEE Transactions on Circuits and Systems for Video Technology, Volume 6, Issue 1, February 1996, pages 113-118 is similar to the paper by Cheung and Po. However, Chan and Siu use different sampling patterns: regular, excluding quincunx. They consider patterns of density ¼ and 1/9 (1 in 2×2 or one in 3×3), and they are not concerned with sub-pixel estimation.
Thus, various types of the metric measuring the difference between image blocks, to be referred to as the metric in the following discussion, have been used in existing codecs for block comparison. Irrespective of the exact metric used, its computation turns out to be computationally expensive.
Therefore, there is a need in the industry for an improved and effective method and system for fast computation of the metric measuring the difference between image blocks.