High quality video compression is a key enabling technology for digital multimedia transmissions. To illustrate the level of compression required for 2 hours of high quality video consider the following example. The raw data storage requirements for uncompressed CCIR-601, sampling ratio 4:2:2, serial digital video are approximately 20 megabytes per second. For a 120 minute movie, 144 Gigabytes of storage space are required to account for the video alone, neglecting the space needed for audio. Currently, even a DVD (Digital Versatile Disk) is incapable of storing 4.7 Gigabytes of data. A compression ratio of approximately 40:1 is required to fit the video data for a feature film as well as the audio and sub-titles on a single sided disk.
Generally, video sequences contain a significant amount of statistical and subjective redundancy within and between frames. The ultimate goal of video source coding is a bit-rate reduction for storage and transmission. This is done by exploring both statistical and subjective redundancies to encode a “minimum set” of information using entropy coding techniques. This results in a coded, compressed version of the source video data. Performance of a video coding depends on the redundancy contained in the image data as well as on the compression techniques used for coding. With practical coding schemes a trade-off exists between coding performance (high compression with sufficient quality) and implementation complexity. A common data reduction scheme is based upon the principle that temporal and spatial redundancy in motion pictures make up the majority of the perceived visual information. By comparing changes from a current frame and a successive frame and removing as much similar information as possible, data storage and transfer requirements for the motion picture media are reduced. Most changes between the target and reference image can be approximated as a translation of small image regions. Therefore a key technique called motion compensation prediction is used.
Successive video frames may contain similar objects in different positions. Motion estimation examines the movement of similar objects in successive images to obtain vectors representing the estimated motion. Motion compensation uses the idea of object movement to obtain a greater data compression. In interframe coding, motion estimation and compensation have become powerful techniques to eliminate the temporal redundancy caused by the high correlation between consecutive frames. Therefore, motion compensated prediction is widely used in high efficiency video codecs (e.g. MPEG video coding standards) as a prediction technique for temporal Differential Pulse Code Modulation (DPCM) coding.
Conceptually, motion compensation is based on the estimation of the motion of objects between video frames. If approximately all the elements in a video scene are spatially displaced, the motion between frames can be described as a limited number of motion parameters (i.e. by motion vectors for the translatory motion of points in objects). The best prediction of a motion parameter is a motion compensated prediction from a previously coded frame. The difference between the previous and current frame is known as a prediction error. Usually, prediction errors and motion vectors are transmitted to the receiver. However, encoding a prediction error and motion vector for each coded image pixel is generally neither desirable nor necessary. Since the spatial correlation between motion vectors is often high, it is sometimes assumed that one motion vector is representative for the motion of a “block” of pixels. Accordingly, images are usually separated into disjointed blocks. Each block contains numerous adjacent pixels within a region (e.g. 16×16 macroblock in MPEG standards) and a single motion vector which is estimated, coded and transmitted for the respective block. The displacement of the macroblock on both the vertical and horizontal plane is called a motion vector. After reducing the temporal redundancies between frames through motion compensation, differential coding is used to reduce the total bit requirement by transmitting the difference between the motion vectors of consecutive frames. In essence, only the prediction error images, the difference between original images and motion compensated prediction images, are encoded. The correlation between pixels in the motion compensated error images is more efficient than the correlation properties of single frames due to the prediction based on a previously coded frame. This error may be coded in a transform domain (e.g. DCT domain). After the transform, only few high frequency coefficients remain in the frequency domain. After the quantization process, the high frequencies only require a small number of bits for representation. Run-length encoding and Huffman encoding are used to encode the transform data into the final state.
In real video scenes, motion within a scene includes a complex combination of translation and rotation. Such translatory and rotation motion is difficult to estimate and may require large amounts of processing. However, translatory motion is easily estimated and has been used successfully for motion compensated coding. Therefore, prior art motion estimation algorithms make the following assumptions: That objects move in translation in a plane that is parallel to the camera plane (i.e., the effects of camera zoom, and object rotations are not considered), that illumination is spatially and temporally uniform and that occlusion of one object by another, and uncovered background are neglected.
There are two mainstream algorithms of motion estimation, the Pel-Recursive Algorithm (PRA) and the Block-Matching Algorithm (BMA). The PRA is an iterative refining of motion estimation for individual pixels using gradient methods. The BMA assumes that all pixels within a block have the same motion. The BMA estimates motion on the basis of rectangular blocks and produces one motion vector for each block. The PRA involves more computational complexity and less regularity than the BMA therefore making it normally slow and unstable. Therefore, the BMA is more suitable for practical use because of its regularity and simplicity.
The BMA divides each frame into non-overlapped blocks, each consisting of luminance and chrominance blocks. Motion estimation is only performed on the luminance block for coding efficiency. Each luminance block in the current frame is then matched against candidate blocks in a search area of the subsequent frame. These candidate blocks are merely versions of the original block temporarily displaced. Upon finding the best candidate block (lowest distortion, i.e., most matched), the displacement is recorded as a motion vector. In an interframe coder, the input frame is subtracted from the prediction of the reference frame to obtain a motion vector. Consequently the motion vector and the resulting error can be transmitted instead of the original luminance block. Therefore, the interframe coder using the BMA removes interframe redundancy and achieves data compression. In the receiver, the decoder builds the frame difference signal from the received data and adds it to the reconstructed reference frames. The more accurate the prediction, the smaller the error signal and hence the smaller the transmission bit rate.
However, the feedback loop for temporal prediction of traditional implementations, non-overlapped block video compression, requires conversion of images from the spatial domain to the transform domain (e.g. frequency domain in DCT transform) Due to principles of the lossy transform, information will be lost in the reconstructed frame which increases the prediction error and blocking effect. The blocking effect is the visual representation of images as blocks of pixels. When smaller pixel blocks are used, the blocking effect is less perceivable than when larger pixel blocks are used. The input information for the lossy transform is obtained from the motion estimation during the interframe compression of subsequent frames. Therefore, the accuracy of the motion compensation procedure directly affects the constructed video quality. Poor motion estimation accuracy will lead to poor video compression quality and a highly perceivable blocking effect.
Therefore, it is desirable to introduce a system which provides an improved motion estimation technique while preserving the quality of original video.
A system according to invention principles addresses these deficiencies and associated problems.