1. Field
This disclosure is directed to a method and an apparatus for encoding video data.
2. Description of the Related Art
Video formats supporting various frame rates exist today. The following formats are currently the most prevalent, listed in order by their supported frames per second (fps): 24 (film native), 25 (PAL), 30 (typically interlaced video), and 60 (High Definition (HD) e.g. 720p). Although these frame rates are suitable for most applications, to reach the low bandwidth required for mobile handset video communications, frame rates are sometimes dropped to rates as low as 15, 10, 7.5, or 3 fps. Although these low rates allow low end devices with lower computational capabilities to display some video, the resulting video quality suffers from “jerkiness” (i.e., having a slide show effect), rather than being smooth in motion. Also, the frames dropped often do not correctly track the amount of motion in the video. For example, fewer frames should be dropped during “high motion” video content portions such as those occurring in sporting events, while more frames may be dropped during “low-motion” video content segments such as those occurring in talk shows. Video compression is content dependent, and it would be desirable to be able to analyze and incorporate motion and texture characteristics in the sequence to be coded so as to improve video compression efficiency.
Frame Rate Up Conversion (FRUC) is a process of using video interpolation at the video decoder to increase the frame rate of the reconstructed video. In FRUC, interpolated frames are created using received frames as references. Currently, systems implementing FRUC frame interpolation (hereinafter “interpolated frames”) include approaches based on motion compensated interpolation and the processing of transmitted motion vectors. FRUC is also used in converting between various video formats. For example, in Telecine and Inverse Telecine applications, which is a film-to-videotape transfer technique that rectifies the respective color frame rate differences between film and video, progressive video (24 frames/second) is converted to NTSC interlaced video (29.97 frames/second).
Another FRUC approach uses weighted-adaptive motion compensated interpolation (WAMCI), to reduce the block artifacts caused by the deficiencies of motion estimation and block based processing. This approach is based on an interpolation by the weighted sum of multiple motion compensated interpolation (MCI) images. The block artifacts on the block boundaries are also reduced in the proposed method by applying a technique similar to overlapped block motion compensation (OBMC). Specifically, to reduce blurring during the processing of overlapped areas, the method uses motion analysis to determine the type of block motion and applies OBMC adaptively. Experimental results indicate that the proposed approach achieves improved results, with significantly reduced block artifacts.
Yet another FRUC approach uses vector reliability analysis to reduce artifacts caused by the use of any motion vectors that are inaccurately transmitted from the encoder. In this approach, motion estimation is used to construct motion vectors that are compared to transmitted motion vectors so as to determine the most desired approach for frame interpretation. In conventional up-conversion algorithms using motion estimation, the estimation process is performed using two adjacent decoded frames to construct the motion vectors that will allow a frame to be interpolated. However, these algorithms attempt to improve utilization of transmission bandwidth without regard for the amount of calculation required for the motion estimation operation. In comparison, in up-conversion algorithms using transmitted motion vectors, the quality of the interpolated frames depends largely on the motion vectors that are derived by the encoder. Using a combination of the two approaches, the transmitted motion vectors are first analyzed to decide whether they are usable for constructing interpolation frames. The method used for interpolation is then adaptively selected from three methods: local motion-compensated interpolation, global motion-compensated interpolation and frame-repeated interpolation.
Although FRUC techniques are generally implemented as post-processing functions in the video decoder, the video encoder is typically not involved in this operation. However, in an approach referred to as encoder-assisted FRUC (EA-FRUC), the encoder can determine if transmission of certain information related to motion vectors or references frames (e.g., residual data), may be eliminated while still allowing the decoder to autonomously regenerate major portions of frames without the eliminated vector or residual data. For example, a bidirectional predictive video coding method has been introduced as an improvement to B-frame coding in MPEG-2. In this method, the use of an error criterion is proposed to enable the application of true motion vectors in motion-compensated predictive coding. The distortion measure is based on the sum of absolute differences (SAD), but this distortion measure is known to be insufficient in providing a true distortion measure, particularly where the amount of motion between two frames in a sequence is to be quantified. Additionally, the variation in thresholds are classified using fixed thresholds when, optimally, these thresholds should be variable as the classifications are preferably content dependent.
FRUC video compression techniques, including those employing encoder enhanced information, use block-based motion prediction with translational motion models to model the motion of objects within video frames. Block-based motion prediction exploits the temporal correlation structure inherent to video signals. Translational motion modeling as used by block-based motion prediction may reduce or eliminate temporal redundancy in video signals for bodies which retain a rigid shape while going through translational motion in a plane more or less parallel to the lens of the video capturing device. The translational motional model uses two parameters per encoded block.
In motion-compensated prediction and transform coding based hybrid video compression, video frames are partitioned by conventional encoders according to use of the translational motion model, where partitions are generated in order to locate object bodies retaining a rigid shape while undergoing translational motion. For example, a video sequence of a person talking to the camera while a car passes by may be partitioned into objects including a still image representing a fixed background for the sequence, a video object representing the talking person's head, an audio object representing the voice associated with the person, and another video object representing the moving car as a sprite with a rectangular region of support. The location of the sprite on the still image may move temporally.
Unfortunately, translational model motion prediction cannot accurately predict or describe motion for objects in motion requiring more than two parameters per block. Independently moving objects in combination with camera motion and focal length change lead to a complicated motion vector field that has to be approximated efficiently for motion prediction. Consequently, the residual signal (also known as the prediction error) has considerable power and therefore video frames containing such movement are not efficient to compress. When video frames containing such objects are interpolated using block-based motion prediction, both the subjective and objective quality of the interpolated frame is low due to the limitations of the translational motion model framework to describe block motion dynamics. Furthermore, when video sequences are partitioned according to translational model motion prediction, the efficiency of algorithms which handle the interpolations of objects undergoing arbitrary motion and deformations is limited.
What is desirable is an approach that provides high quality interpolated frames at the decoder device that appropriately model moving objects while decreasing the amount of bandwidth potentially needed to transmit the information for performing the interpolation, and that also decrease the volume of calculation potentially needed to create these frames so as to make it well suited to multimedia mobile devices that depend on low-power processing.