The present invention relates to digital video signal compression and, in particular, to a particularly efficient signal encoding mechanism for encoding digital video signals according to digital video standards such as the ITU standard H.263.
With the advent of digital video products and services, such as Digital Satellite Service (SSS) and storage and retrieval of video streams on the Internet and, in particular, the World Wide Web, digital video signals are becoming ever present and drawing more attention in the marketplace. Because of limitations in (i) digital signal storage capacity, (ii) in network and broadcast bandwidth, and (iii) in client computer system process bandwidth, compression of dc digital video signals has become paramount to digital video storage and transmission. As a result, many standards for compression and encoding of digital video signals have been promulgated. For example, the International Telecommunication Union (ITU) has promulgated the H.261 and H.263 standards for digital video encoding. Additionally, the International Standards Organization (ISO) has promulgated the Motion Picture Experts Group (WEG), MPEG-1, and MPEG-2 standards for digital video encoding.
These standards specify with particularity the form of encoded digital video signals and how such signals are to be decoded for presentation to a viewer. However, significant discretion is left as to how the digital video signals are to be transformed from a native, uncompressed format to the specified encoded format. As a result, many different digital video signal encoders currently exist and many approaches are used to encode digital video signals with varying degrees of compression achieved.
In general, significant degrees of compression of motion video signals achieved by today""s motion video signal encoders force a compromise between encoded signal quality and bandwidth consumed by the encoded video signal. Specifically, encoding a motion video signal such that more of the original content and quality of the motion video signal is preserved consumes additional bandwidth. Conversely, encoding a motion video signal so as to minimize consumed bandwidth generally degrades the quality of the encoded motion video signal.
In current systems for delivering motion video signals to a client computer system for display for a user, there are primarily two types of bandwidth which are limited. The first is delivery bandwidth, i.e., the bandwidth of the transmission medium through which the encoded motion video signal is delivered to the client computer system. The second is client processing bandwidth, i.e., the amount of processing capacity which is available to decode the encoded motion video signal within the client computer system. As greater and greater degrees of compression are realized by today""s motion video signal encoders, greater and greater processing capacity is required by client computer systems which decode these encoded motion video signals. In general, such client computer systems must decode and display as much as thirty frames per second. If the available processing bandwidth is exceeded, some or all of the sequence of video images are lost and, therefore, so is the infegrity of the motion video signal. If an encoded motion video signal errs on the side of conserving processing bandwidth, the quality of the motion video image can be compromised significantly.
The format of H.263 encoded digital video signals is known and is described more completely in xe2x80x9cITU-T H.263: Line Transmission of Non-Telephone Signals, Video Coding for Low Bitrate Communicationxe2x80x9d (hereinafter xe2x80x9cITU-T Recommendation H.263xe2x80x9d). Briefly, a digital motion video image, which is sometimes called a video stream, is organized hierarchically into groups of pictures which includes one or more frames. Each frame represents a single image of a sequence of images of the video stream and includes a number of macroblocks which define respective portions of the video image of the frame. An I-frame is encoded independently of all other frames and therefore completely represents an image of the sequence of images of the video stream. P-frames are motion-compensated frames and are therefore encoded in a manner which is dependent upon other frames. Specifically, a P-frame is a predictively motion-compensated frame and depends only upon one I-frame or, alternatively, another P-frame which precedes the P-frame in the sequence of frames of the video image. The H.263 standard also describes BP-frames; however, for the purposes of description herein, a BP-frame is treated as a P-frame.
All frames are compressed by reducing redundancy of image data within a single frame. Motion-compensated frames are further compressed by reducing redundancy of image data within a sequence of frames. Since a motion video signal includes a sequence of images which differ from one another only incrementally, significant compression can be realized by encoding a number of frames as motion-compensated frames, i.e., as P-frames. In addition, reconstructing motion-compensated frames represents a significant portion of the processing required to decode an encoded motion video signal.
In motion estimation, each macroblock of a frame is compared to a number of different equivalent-sized portions of a previous frame. In an exhaustive motion estimation search, a macroblock, which represents a 16-pixel by 16-pixel block of a frame, is compared to every possible 16-pixel by 16-pixel block of a previous frame, even blocks which are not aligned on macroblock boundaries. In pursuit of even better motion estimation, some systems interpolate half-pixels between pixels of the previoustame and compare each macroblock to each 16-pixel by 16-pixel block of half-pixels of the previous frame. However, exhaustive searching of every possible 16-pixel by 16-pixel block of pixels and every possible 16-pixel by 16-pixel block of half-pixels is, in terms of computation and processing resources, prohibitively expensive.
Accordingly, conventional motion estimation systems try to derive a motion vector between a macroblock and a 16-pixel by 16-pixel block of a previous frame without exhaustively comparing all possible blocks. One such system is the xe2x80x9cthree-stage log search plus half-pixel.xe2x80x9d However, encoding macroblocks to include motion vectors to 16-pixel by 16-pixel block of half-pixels requires substantial processing by a decoder of a client computer system which reconstructs the macroblocks from the motion vectors since such a decoder must re-derive the half-pixels in decoding the macroblocks. Client computer systems in which decoders typically operate are generally smaller, slower, less expensive computer systems and therefore have less processing capacity than computers in which encoders operate. In addition, each frame must be decoded within a particular amount of time to display successive frames quickly enough that the viewer perceives motion in the video image. Additional processing requirements introduced by half-pixel encoding can push the aggregate processing requirements of motion video image decoding beyond the capability of the client computer system.
What is needed is a motion video signal encoding mechanism in which the benefits of half-pixel motion estimation are realized while minimising the processing burden imposed upon the decoder of the client computer system.
In accordance with the present invention, a motion estimator/compensator determines whether to use half-pixel motion vector encoding by comparing the relative benefit of using half-pixel encoding to the processing cost of the client computer system in decoding the half-pixel encoded macroblock. Specifically, the motion estimator/compensator quantifies the following distortions: (i) a whole pixel distortion between a subject macroblock and a whole pixel pseudo-macroblock, (ii) a half-column pixel distortion between the subject macroblock and a half-column pixel pseudo-macroblock, (iii) a half-row pixel distortion between the subject macroblock and a half-row pixel pseudo-macroblock, and (iv) a half-column/half-row pixel distortion between the subject macroblock and a half-column/half-row pixel pseudo-macroblock. Each distortion represents the difference between the original macroblock and the macroblock encoded as a motion vector.
Each quantified distortion is combined with a respective processing burden associated with the specific type of half-pixel encoding. For example, a whole pixel processing burden represents the computational complexity of deriving a whole pixel pseudo-macroblock, i.e., a single read operation per pixel. A half-column pixel processing burden represents the computational complexity of deriving a half-column pseudo-macroblock, i.e., two read operations of adjacent memory locations, two addition operations, and one shift operation. A half-row processing burden represents the computational complexity of deriving a half-row pseudo-macroblock, i.e., two read operations of relatively distant memory locations, two addition operations, and one shift operation. A half-column/half-row processing burden represents the computational complexity of deriving a half-column/half-row pseudo-macroblock, i.e., four read operations of relatively distant memory locations, four addition operations, and one shift operation.
In one embodiment, the difference between the half-column pixel distortion and the whole pixel distortion represents the improvement in encoded motion video signal quality achieved by using half-column pixel motion estimation in encoding the subject macroblock. The motion estimator/compensator compares this difference to a predetermined half-column threshold which represents an additional processing burden imposed upon a client computer system in decoding the subject macroblock if encoded using half-column pixel motion estimation. Specifically, the predetermined half-column threshold represents the difference between the whole pixel processing burden and the half-column processing burden. If the difference between distortions is less than the predetermined half-column threshold, the benefit of half-column motion estimation does not justify the additional processing burden imposed upon the client computer system and half-column motion estimation is not used in encoding the subject macroblock.
The motion estimator/compensator determines analogous differences for the half-row distortion and the half-column/half-row distortion and compares those distortion differences to predetermined half-row and half-column/half-row thresholds, respectively. The predetermined half-row and half-column/half-row thresholds represent additional processing burdens imposed upon a client computer system in decoding the subject macroblock if encoded using half-row or half-column/half-row motion estimation, respectively. In other words, the predetermined half-row and half-column/half-row thresholds represent (i) the difference between the half-row processing burden and the whole pixel processing burden and (ii) the difference between the half-column/half-row processing burden and the whole pixel processing burden, respectively.
In an alternative embodiment, each processing burden is weighted according to the processing capacity of a particular client computer system and the corresponding distortion is added to the weighted processing burden to quantify a distortion/processing burden combination. The particular one of the whole-pixel, half-column, half-row, and half-column/half-row combinations which has the smallest quantified value is selected as the best compromise between distortion and processing burden imposed upon the client computer system.
Thus, each type of half-pixel motion estimation is evaluated individually and used for encoding the subject macroblock only if the benefit of the type of half-pixel motion estimation justifies the corresponding additional processing burden imposed upon a client computer system.
Since decoding macroblocks encoded using half-column motion estimation requires more processing than decoding macroblocks encoded using whole pixel motion estimation, the predetermined half-column threshold is greater than zero. In addition, decoding macroblocks encoded using half-row motion estimation requires more processing than decoding macroblocks encoded using half-column motion estimation, and decoding macroblocks encoded using half-column/half-row motion estimation requires more processing than decoding macroblocks encoded using half-row motion estimation. Accordingly, the predetermined half-column/half-row threshold is greater than the predetermined half-row threshold, and the predetermined half-row threshold is greater than the predetermined half-column threshold. Similarly, the half-column/half-row processing burden is greater than the half-row processing burden which is greater than the half-column processing burden which in turn is greater than the whole pixel processing burden.
Thus, according with the principles of the present invention, encoded motion video signal quality is enhanced while ensuring the processing bandwidth of client computer systems is not exceeded in decoding the encoded motion video signals.