With the advent of digital video products and services, such as Digital Satellite Service (DSS) and storage and retrieval of video streams on the Internet and, in particular, the World Wide Web, digital video signals are becoming ever present and drawing more attention in the marketplace. Because of limitations in digital signal storage capacity and in network and broadcast bandwidth limitations, compression of digital video signals has become paramount to digital video storage and transmission. As a result, many standards for compression and encoding of digital video signals have been promulgated. For example, the International Telecommunication Union (ITU) has promulgated the H.261 and H.263 standards for digital video encoding. Additionally, the International Standards Organization (ISO) has promulgated the Motion Picture Experts Group (MPEG), MPEG-1, and MPEG-2 standards for digital video encoding.
These standards specify with particularity the form of encoded digital video signals and how such signals are to be decoded for presentation to a viewer. However, significant discretion is left as to how the digital video signals are to be transformed from a native, uncompressed format to the specified encoded format. As a result, many different digital video signal encoders currently exist and many approaches are used to encode digital video signals with varying degrees of compression achieved.
In general, greater degrees of compression are achieved at the expense of video image signal loss and higher quality motion video signals are achieved at the expense of lesser degrees of compression and thus at the expense of greater bandwidth requirements. It is particularly difficult to balance image quality with available bandwidth when delivery bandwidth is limited. Such is the case in real-time motion video signal delivery such as video telephone applications and motion video on demand delivery systems. It is generally desirable to maximize the quality of the motion video signal as encoded without exceeding the available bandwidth of the transmission medium carrying the encoded motion video signal. If the available bandwidth is exceeded, some or all of the sequence of video images are lost and, therefore, so is the integrity of the motion video signal. If an encoded motion video signal errs on the side of conserving transmission medium bandwidth, the quality of the motion video image can be compromised significantly.
The format of H.263 encoded digital video signals is known and is described more completely in "ITU-T H.263: Line Transmission of Non-Telephone Signals, Video Coding for Low Bitrate Communication" (hereinafter "ITU-T Recommendation H.263"). Briefly, a digital motion video image, which is sometimes called a video stream, is organized hierarchically into groups of pictures which includes one or more frames, each of which represents a single image of a sequence of images of the video stream. Each frame includes a number of macroblocks which define respective portions of the video image of the frame. An I-frame is encoded independently of all other frames and therefore represents an image of the sequence of images of the video stream without reference to other frames. P-frames are motion-compensated frames and are therefore encoded in a manner which is dependent upon other frames. Specifically, a P-frame is a predictively motion-compensated frame and depends only upon one I-frame or, alternatively, another P-frame which precedes the P-frame in the sequence of frames of the video image. The H.263 standard also describes BP-frames; however, for the purposes of description herein, a BP-frame frame is treated as a P-frame.
All frames are compressed by reducing redundancy of image data within a single frame. Motion-compensated frames are further compressed by reducing redundancy of image data within a sequence of frames. Since a motion video signal includes a sequence of images which differ from one another only incrementally, significant compression can be realized by encoding a number of frames as motion-compensated frames, i.e., as P-frames. However, errors from noise introduced into the motion video signal or artifacts from encoding of the motion video signal can be perpetuated from one P-frame to the next and therefore persist as a rather annoying artifact of the rendered motion video image. It is therefore desirable to periodically send an I-frame to eliminate any such errors or artifacts. Conversely, I-frames require many times more bandwidth, e.g., on the order of ten times more bandwidth, than P-frames, so encoding I-frames too frequently consumes more bandwidth than necessary. Accordingly, determining when to include an I-frame, rather than a P-frame, in an encoded video stream is an important consideration when maximizing video image quality without exceeding available bandwidth.
Another important consideration when maximizing video image quality within limited signal bandwidth is a quantization parameter Q. In encoding a video signal according to a compression standard such as H.263, a quantization parameter Q is selected as a representation of the compromise between image detail and the degree of compression achieved. In general, a greater degree of compression is achieved by sacrificing image detail, and image detail is enhanced by sacrificing the degree of achievable compression of the video signal.
However, a particular quantization parameter Q which is appropriate for one motion video signal can be entirely inappropriate for a different motion video signal. For example, motion video signals representing a video image which changes only slightly over time, such as a news broadcast (generally referred to as "talking heads"), can be represented by relatively small P-frames since successive frames differ relatively little. As a result, each frame can include greater detail at the expense of less compression of each frame. Conversely, motion video signals representing a video image which changes significantly over time, such as fast motion sporting events, require larger P-frames since successive frames differ considerably. Accordingly, each frame requires greater compression at the expense of image detail.
Determining an optimum quantization parameter Q for a particular motion video signal can be particularly difficult. Such is especially true for some motion video signals which include both periods of little motion and periods of significant motion. For example, in a motion video signal representing a football game includes periods where both teams are stationary awaiting the snap of the football from the center to the quarterback and periods of sudden extreme motion. Selecting a quantization parameter Q which is too high results in sufficient compression that frames are not lost during high motion periods but also in unnecessarily poor image quality during periods were players are stationary or moving slowly between plays. Conversely, selecting a quantization parameter Q which is too low results in better image quality during periods of low motion but likely results in loss of frames due to exceeded available bandwidth during high motion periods.
A third factor in selecting a balance between motion video image quality and conserving available bandwidth is the frame rate of the motion video signal. A higher frame rate, i.e., more frames per second, provides an appearance of smoother motion and a higher quality video image. At the same time, sending more frames in a given period of time consumes more of the available bandwidth. Conversely, a lower frame rate, i.e., fewer frames per second, consumes less of the available bandwidth but provides a motion video signal which is more difficult for the viewer to perceive as motion between frames and, below some threshold, the motion video image is perceived as a "slide show," i.e., a sequence of discrete, still, photographic images. However, intermittent loss of frames resulting from exceeding the available threshold as a result of using an excessively high frame rate provides a "jerky" motion video image which is more annoying to viewers than a regular, albeit low, frame rate.
While use of P-frames which avoid redundant information between successive frames reduces the amount of data needed to represent a particular frame of a motion video image, further reductions are achieved by a mechanism known as conditional replenishment. In conditional replenishment, a portion of a frame, typically a macroblock, is compared to the corresponding macroblock of the previous frame. If the differences between the two macroblocks are below a predetermined threshold, the current macroblock is not encoded at all and a flag is set to so indicate. The H.263 standard provides for such a flag in the standard format proscribed by the H.263 standard. In addition to reducing the required bandwidth of the encoded motion video signal, conditional replenishment reduces the amount of processing required to encode a motion video signal. In general, the most computationally expensive component of encoding a motion video signal is the estimation of and compensation for motion which is necessary in encoding motion-compensation frames such as P-frames. By quickly recognizing macroblocks which do not change significantly from frame to frame and therefore avoiding motion estimation and compensation for such macroblocks, the processing resources required to encode a motion video is significantly reduced.
However, conditional replenishment generally introduces particularly annoying artifacts into a decoded motion video signal. The most annoying of such artifacts is best described by way of example. Consider a motion video signal which includes an interview of a person who is shown sitting and moving only slightly and predominantly with head movements (commonly referred to as a "talking head" image). The interviewee's head may occasionally be moved such that just the tips of her hair lie across a boundary between macroblocks. Although macroblocks in which the majority of her head is represented are replaced since corresponding macroblocks differ significantly, macroblocks in which only a very small bit of hair is represented can be sufficiently similar to the corresponding macroblock of the previous frame that the macroblock is not replaced by subsequently encoded macroblocks by operation of conditional replenishment. The proliferation of detached hair tips along rectangular macroblock boundaries is a particularly noticeable and annoying artifact.
A similar annoying effect occurs when an object spans two or more macroblocks and the object moves slightly such that part of the object in one macroblock is sufficiently different to cause a new macroblock to be coded in the subsequent frame while that part of the object in an adjacent macroblock changes so slightly that the macroblock is not replaced in the subsequent macroblock. Consider again the talking head image. Suppose that a macroblock border cuts horizontally across the bridge of the nose of the interviewee. Suppose further that her head moves slightly to one side. Since her eyes include a stronger contrast that her nose and mouth; therefore, the macroblock which includes her eyes can change sufficiently to cause a new macroblock containing her eyes to be encoded in a subsequent frame while the macroblock including her nose and mouth changes insufficiently, due in part to the lower contrast of the nose and mouth, and is not encoded in the subsequent frame. The result is the perception by the viewer that her head is being twisted or stretched. This phenomena is commonly known as "object shear" and is particularly annoying.
Conventional conditional replenishment mechanisms attempt to avoid such artifacts by setting a macroblock distortion threshold, i.e., the degree of similarity between macroblocks required to justify declination of encoding and transmitting the macroblock, so low that macroblocks which include changes as small as hair tips are detected as changed macroblocks and are therefore encoded and transmitted. However, setting a macroblock distortion threshold so low causes macroblocks which differ from corresponding macroblocks of a previous frame only by ordinary noise present in the macroblocks to be encoded and transmitted as well. As a result, any benefit of conditional replenishment in terms of less bandwidth used and less processing resources required is generally not realized when such a macroblock distortion threshold is excessively low. Of course, the danger in leaving such a macroblock distortion threshold too high is that the annoying artifacts described above proliferate.
I-frame placement and quantization parameter Q selection combine to represent a compromise between motion video image quality and conservation of available bandwidth. However, to date, conventional motion video encoders have failed to provide satisfactory motion video image quality within the available bandwidth. In addition, conventional conditional replenishment mechanisms conserve valuable bandwidth but with an unacceptable degradation in motion video signal quality.