Field of the Invention
The present invention relates in general to the field of information processing, and more specifically to a system and method for efficiently using video processing resources by adjusting utilization of prediction error computational resources in video encoders.
Description of the Related Art
The number of electronic devices incorporating multimedia technology continues to proliferate. Multimedia technology enables a device to use and present multimedia data. Multimedia data can be used and presented in a variety of forms including full-motion video, animation, graphics, audio, and text. Multimedia data can be transmitted to a variety of devices for use. Such devices include portable electronic devices such as personal computers, server computers, notebook computers, mobile phones, personal digital assistants, video playback devices, and any other device capable of rendering multimedia data.
One of the challenges with digital multimedia relates to the size of multimedia data and particularly video data. Uncompressed digital video data in particular requires a significant number of bits to represent each video frame. Each video frame can be divided into a number of pixels. Pixels represent the smallest unit of video that can be manipulated. Each pixel is represented by a set of bits. For 24-bit color video (8-bits for each of the three color components), each uncompressed video frame is, thus, represented by 24 bits times the number of pixels in each video frame for a 4:4:4 color format. For example, the relatively small 4:2:0 video frames of quarter common intermediate format (QCIF), have 176×144 25,344 luma pixels and 88×72×2=12672 chroma pixels, which requires (25344+12672)×8 bits per frame for uncompressed video. To avoid human perceptible flicker and to maintain smooth motion, video data contains, for example, 30 frames or 60 fields per second of displayed video and, thus, (25344+12672)×8×30 uncompressed bits/second for QCIF video. It is currently impractical to transmit and process such a huge amount of data in a reasonable amount of time.
Digital video compression has become a critical enabling technology for multimedia storage and transmission. Because of the huge data rate of raw digital video data, compression techniques compress a video signal before it can be transmitted or stored. Differential pulse code modulation (“DPCM”) and the discrete cosine transform (“DCT”) are examples of coding technologies proven effective for successful video compression. DPCM and DCT have become the standard coding technology for current international digital video coding and decoding compression standards such as the International Telecommunication Union (“ITU”) H.264, or also called the International Organization for Standardization/International Electrotechnical Commission (“ISO/IEC”) MPEG-4 Part 10 Advanced Video Coding (“AVC”) standard, and other previous H.26x and MPEG-x standards.
H.264 or MPEG-4 Part 10 AVC utilizes block-based motion compensation and motion estimation to encode video data. Video data includes a video sequence of progressive or interlaced video frames, which may be mixed together in the same sequence. Each video frame essentially represents a ‘picture’. When an object has only translational motion from one frame to the next, successive frames are very similar to preceding frames and, thus, exhibit a strong temporal correlation. The similarity between successive frames is referred to as “temporal redundancy.” To achieve compression, the temporal redundancy between adjacent frames can be exploited.
One embodiment of video compression involves inter-coding and intra-coding of video frames. At least one of the frames is “intracoded”, i.e. coded without using information other than information in the frame itself. Samples in the ‘intra’ frame are predicted using spatially neighboring samples of previously coded blocks. In addition to being used as a reference frame, intraframe coding can be used when frame-to-frame motion other than translational motion, such as camera pan, zoom, changes in luminance, or rotational motion, is present. Remaining frames in a sequence of intraframes are “intercoded”.
Intercoding utilizes motion estimation and motion compensation. Motion estimation technology selects one or more previous or future frames as a reference(s). (Note: the description below is written in the context of one previous frame as the reference frame; however, multiple reference frames can be used.) Motion compensation predicts frames from one or more reference frames and coding the prediction. Motion estimation is the process of choosing a reference frame and determining spatial displacement of an object in a reference frame and a current frame.
One technique of estimating motion uses a block-matching algorithm (“BMA”). Motion estimation examines the movement of objects in a video frame sequence to determine motion vectors representing the estimated motion of the objects. BMAs estimate motion on the basis of rectangular blocks and produce one motion vector for each block. As depicted in FIG. 1, block-matching techniques divide an M-by-N video frame 102 into M×N blocks of pixels 104 [1,1] through 104 [M,N], where “M” and “N” respectively represent the number of block rows and block columns in video frame 102. In at least one embodiment, some blocks overlap each other. The H.264/MPEG-4 Part 10 AVC support dividing the sets of image data into m×n blocks that consist of data generally representing at least the luminance of pixels in each block. The block sizes generally range from 16×16 pixels to 4×4 pixels. A 16×16 set is generally referred to as a “macroblock” 106, and smaller sets are partitions of a macroblock. For example, a 16×16 macroblock can be partitioned into 16×8 (108), 8×16 (110), and 8×8 (112) and sub-partitioned into 8×8 (114), 4×8 (116), 8×4 (118), and 4×4 (120) blocks of pixel data. Smaller blocks can provide enhanced prediction accuracy for a certain video content. Each frame can be divided into combinations of macroblocks, partitioned macroblocks, and sub-macroblock partitions to enhance prediction accuracy while controlling bit rates needed to code each frame. In at least one embodiment, the blocks can be of varying sizes.
FIG. 2 depicts frames and object movement used in a block-matching technique for intercoding video data. For each block in a current frame 202, motion estimation searches a predetermined block-matching search area 204 of the reference frame 206 for a block that best matches (“the best matching block”) the block in the current frame 202. Motion estimation uses an error measure to identify the best matching block. The search is generally confined to a subset of the macroblocks in the reference frame that represent the anticipated motion range of an object between the reference frame and the current frame. Motion estimation uses a motion vector 208 to represent the translation between the current block and the best matching block. The error measure used to identify the best matching block becomes a prediction error that represents the difference between the current block and the best matching block. The motion vector and the prediction error can be efficiently coded at a far lower bit rate than individually coding each successive video frame in a sequence of video frames. Thus, interframe redundancy is removed and data compression is achieved. A decoder reconstructs each frame from the motion vector, prediction error, and reference frames.
Researches have investigated several error measures to determine the best matching block and to describe the prediction error. The mean absolute difference (“MAD”) is generally considered to be the most favored. The MAD is determined from the sum of absolute difference (“SAD”) divided by the m×n pixels in each block. The SAD is represented in Equation [1], and the MAD is represented in Equation [2]:
                                                                                          S                  ⁢                                                                          ⁢                  A                  ⁢                                                                          ⁢                  D                                =                                ⁢                                                      ∑                                          i                      =                      1                                        m                                    ⁢                                                            ∑                                              j                        =                        1                                            n                                        ⁢                                                                                        residual                                                  i                          ,                          j                                                                                                                                                                                                                  =                                ⁢                                                      ∑                                          i                      =                      1                                        m                                    ⁢                                                            ∑                                              j                        =                        1                                            n                                        ⁢                                                                                                                  current_block                                                      i                            ,                            j                                                                          -                                                  reference_block                                                      i                            ,                            j                                                                                                                                                                                                        ⁢                                  ⁢        and                            [        1        ]                                          M          ⁢                                          ⁢          A          ⁢                                          ⁢          D                =                              S            ⁢                                                  ⁢            A            ⁢                                                  ⁢            D                                m            ×            n                                              [        2        ]            
The better the prediction between the current frame and the best matching block, i.e. the smaller the MAD of the best matching block, the smaller the prediction error will be, and, thus, the bit rate for each frame can be smaller. Usually, a rate-distortion measurement is used in motion estimation to balance the MAD and the cost of encoding motion vectors.
Sub-integer pixel motion compensation using interpolation can provide significantly better compression performance and visual quality than integer-pixel motion compensation. The H.264/MPEG-4 AVC has one-half and one-quarter pixel resolution for inter-coded macroblocks, i.e. the accuracy of motion compensation is a half or quarter pixel distance. If the horizontal and vertical components of the motion vector are integers, the current object actually exists in the reference frame in integer pixel position. If one or both components of the motion vector are sub-integers, then the current object exists in an interpolated position between adjacent samples in the reference frame. However, sub-integer pixel motion compensation comes at an increased expense of design complexity and computation time.
U.S. Pat. No. 5,757,668, entitled “Device, Method and Digital Video Encoder of Complexity Scalable Block-Matching Motion Estimation Utilizing Adaptive Threshold Termination”, inventor Qin Fan Zhu, filed May 24, 1995, and issued May 26, 1998 (“Zhu Patent”) observes that the ultimate goal in practical video compression is not to minimize the prediction error (also referred to as a “matching error”) but to optimize the coded video quality under constraints of a given channel bandwidth and processing power. The Zhu Patent discusses terminating the search for a best matching block once the matching error is less than a predetermined threshold. The reasoning behind the Zhu Patent is that under certain circumstances, finding the best matching block neither improves the coded picture quality nor reduces the bitrate. The Zhu Patent identifies a threshold value T based upon a linear function of a quantization stepsize QP (also commonly referred to as a “quantization parameter”) and two coefficients “a” and “b” as depicted in Equation [3]:T=a*QP+b  [3]The coefficients “a” and “b” are monotonically non-increasing function of a processing quota.
QP is a parameter typically used by a video encoder to regulate how much detail is saved by a video encoder. Video encoders transform prediction errors into a frequency domain by a transform that approximates the DCT. QP determines the step size for associating the transformed coefficients with a finite set of steps. Large values of QP represent big steps that crudely approximate the transform, so that most of the signal can be captured by only a few coefficients. Small values of QP more accurately approximate the block's frequency spectrum, but at the cost of more bits. Thus, when QP is very small, almost all that detail is retained. As QP is increased, some of that detail is aggregated so that the bit rate drops at the price of some increase in distortion and some loss of quality. Thus, the threshold T represents a measure of the quality of video frames.
The Zhu Patent compares the prediction error with the threshold T. If the prediction error is less than the threshold T, the search for the best matching block terminates because continuing the search will result in a nominal improvement of video quality at best. If the prediction error is greater than or equal to the threshold T, the motion estimation process continues searching for the best matching block. Thus, the Zhu Patent describes how to reduce searches for the best matching block.
However, prediction error reduction computations by the encoders and decoders of video processing systems continue to be very numerous and require a significant amount of power. Devices with limited computational resources and power reserves are still often strained by the prediction error reduction computational requirements of conventional video data processing.