Video and image pre-processing technology is used widely in video processing, compression, broadcasting, storage and broadband networking, imaging, printing and other areas for achieving better quality and higher efficiency. In the video coding and transcoding fields, pre-processing provides the advantages of obvious visual quality enhancement and substantial compression efficiency by noise reduction and visual enhancement of corrupted or distorted video sequences from the capture source or transmission processes.
Pre-processing technology is also becoming necessary in video compression, storage and transportation, because high quality pre-processing provides a real differential and advantageous technological edge in terms of yielding both a high visual quality and a high compression ratio, particularly in the Video-over-IP, HDTV and related markets. To give a quantitative example, the advantageous pre-processing scheme in accordance with the present invention can result in a visual quality improvement of about 3 dB in signal to noise ratio and about 50-60% in coding efficiency gain. This is an improvement even greater than that achievable under the H.264 (Advanced Video Coding) standard over the MPEG-2 standard.
Furthermore, due to the nature of video sequences, motion estimation has been used in a variety of video processing and compression applications, including pre-processing. There are numerous motion estimation algorithms known in the literature for image processing, visual noise reduction filtering and video compression. However, many of these algorithms are too expensive in computational complexity to be implemented, or they are unsatisfactory in their performance. Indeed, some algorithms are both too expensive and unsatisfactory.
The new pre-processing optimization combined with later stage encoding or transcoding optimization in accordance with the statistical content block matching scheme of the present invention yields a more advanced and competitive approach.
The following is a brief overview of certain areas of information forming the background to the present invention.
(A) Encoding
The present invention is advantageously embodied in a video encoder. A conventional video encoder is preferably an encoder which utilizes a video compression algorithm to provide, for example, an MPEG-2 compatible bit stream. The MPEG-2 bit stream has six layers of syntax. These are a sequence layer (random access unit, context), Group of Pictures layer (random access unit, video coding), picture layer (primary coding layer), slice layer (resychronization unit), macroblock (motion compensation unit), and block layer (DCT unit). The encoder distinguishes between three kinds of pictures, I (“intra”), P (“predictive”) and B (“bi-predictive”). A group of pictures (GOP) is a set of frames which starts with an I-picture and includes a certain number of P and B pictures. The number of pictures in a GOP may be fixed. The coding of I pictures results in the greatest number of bits. In an I-picture, each 8×8 block of pixels is defined as a macroblock and undergoes a DCT transform to form an 8×8 array of transform coefficients. The transform coefficients are then quantized with a variable quantizer matrix. The resulting quantized DCT coefficients are scanned using, e.g., zig-zag scanning, to form a sequence of DCT coefficients. The DCT coefficients are then organized into run, level pairs. The run, level pairs are then entropy encoded. In an I-picture, each macroblock is encoded according to this technique, which is known as spatial encoding.
In a P-picture, a decision is made to code the macroblock as an I macroblock or as a P macroblock. For each P macroblock, a prediction of the macroblock in a previous video picture is obtained. While this technique is discussed in more detail below, generally the prediction macroblock is identified by a motion vector which indicates the translation between the macroblock to be coded in the current picture and its “best match” prediction in a previous picture. The predictive error between the prediction macroblock and the current macroblock is then coded using the DCT, quantization, scanning, run, level pair encoding, and entropy encoding.
In the coding of a B-picture, a decision has to be made as to the coding of each macroblock. The choices are (a) intracoding (as in an I macroblock), (b) unidirectional backward predictive coding using a subsequent picture to obtain a motion compensated prediction, (c) unidirectional forward predictive coding using a previous picture to obtain a motion compensated prediction, and (d) bidirectional predictive coding wherein a motion compensated prediction is obtained by interpolating a backward motion compensated prediction and a forward motion compensated prediction. In the cases of forward, backward, and bidirectional motion compensated prediction, the predictive error is encoded using DCT, quantization, zig-zag scanning, run, level pair encoding, and entropy encoding.
B pictures have the smallest number of bits when encoded, then P pictures, with I pictures having the most bits when encoded. Thus, the greatest degree of compression is achieved for B pictures. For each of the I, B, and P pictures, the number of bits resulting from the encoding process can be controlled by controlling the quantizer step size. A macroblock of pixels or pixel errors which is coded using a large quantizer step size results in fewer bits than if a smaller quantizer step size is used. Other techniques may also be used to control the number of encoded bits.
(B) Motion Estimation
As indicated above, temporal encoding typically involves finding a prediction macroblock for each to-be-encoded macroblock. The prediction macroblock is subtracted from the to-be-encoded macroblock to form a prediction error macroblock. The individual blocks of the prediction error macroblock are then spatially encoded.
Each prediction macroblock originates in a picture other than the to-be-encoded picture, called a “reference picture.” A single prediction macroblock may be used to “predict” a to-be-encoded macroblock or multiple prediction macroblocks, each origination in a different reference picture, may be interpolated, and the interpolated prediction macroblock may be used to “predict” the to-be-encoded macroblock. Preferably, the reference picture, themselves, are first encoded and then decompressed or “decoded.” The prediction macroblocks used in encoding are selected from “reconstructed pictures” produced by the decoding process. Reference pictures temporally precede or succeed the to-be-encoded picture in the order of presentation or display. Based on these reference pictures, the I, P and B encoded pictures may be produced.
MPEG-2 supports several different types of prediction modes which can be selected for each to-be-encoded macroblock, based on the types of predictions that are permissible in that particular type of picture. Of the available prediction modes, two prediction modes are described below which are used to encoded frame pictures. According to a “frame prediction mode” a macroblock of a to-be-encoded frame picture is predicted by a frame prediction macroblock formed from one or more reference frames. For example, in the case of a forward only predicted macroblock, the prediction macroblock is formed from a designated preceding reference frame. In the case of backward only predicted macroblock, the prediction macroblock is formed from a designated succeeding reference frame. In the case of a bi-predicted macroblock, the prediction macroblock is interpolated from a first macroblock formed from the designated preceding reference frame and a second prediction macroblock formed from the designated succeeding reference frame.
According to a “field prediction mode for frames” a macroblock of a to-be-encoded frame picture is divided into to-be-encoded top and bottom field macroblocks. A field prediction macroblock is separately obtained for each of the to-be-encoded top and bottom field macroblocks. Each field prediction macroblock is selected from top and bottom designated reference fields. The particular fields designated as reference fields depend on whether the to-be-encoded field macroblock is the first displayed field of a P-picture, the second displayed field of a P-picture or either field of a B-picture. Other well known prediction modes applicable to to-be-encoded field pictures include dual prime, field prediction of field pictures and 16×8 prediction. For sake of brevity, these modes are not described herein.
Prediction macroblocks often are not at the same relative spatial position (i.e., the same pixel row and column) in the reference picture as the to-be-encoded macroblock spatial position in the to-be-encoded picture. Rather, a presumption is made that each prediction macroblock represents a similar portion of the image as the to-be-encoded macroblock, which image portion may have moved spatially between the reference picture and the to-be-encoded picture. As such, each prediction macroblock is associated with a motion vector, indicating a spatial displacement from the prediction macroblock's original spatial position in the reference field to the spatial position corresponding to the to-be-encoded macroblock. This process of displacing one or more prediction macroblocks using a motion vector is referred to as motion compensation.
In motion compensated temporal encoding, the best prediction macroblock(s) for each to-be-encoded macroblock is generally not known ahead of time. Rather, a presumption is made that the best matching prediction macroblock is contained in a search window of pixels of the reference picture around the spatial coordinates of the to-be-encoded macroblock (if such a prediction macroblock exists at all). Given a macroblock of size I×J pixels, and a search range of ±H pixels horizontally and ±V pixels vertically, the search window is of size (I+2H)(J+2V). A block matching technique may be used, whereby multiple possible prediction macroblock candidates at different spatial displacements (i.e., with different motion vectors) are extracted from the search window and compared to the to-be-encoded macroblock. The best matching prediction macroblock candidate may be selected, and its spatial displacement is recorded as the motion vector associated with the selected prediction macroblock. The operation by which a prediction macroblock is selected, and its associated motion vector is determined, is referred to as motion estimation.
Block matching in motion estimation requires identifying the appropriate search window for each to-be-encoded macroblock (that can possibly be temporally encoded). Then multiple candidate macroblocks of pixels must be extracted from each search window and compared to the to-be-encoded macroblock. According to the MPEG-2 chrominance format 4:2:0, for example, each macroblock includes a 2×2 arrangement of four (8×8 pixel) luminance blocks (illustratively, block matching is performed only on the luminance blocks). If each to-be-encoded picture is a CIF format picture (352×288 pixels for NTSC frames and 352×144 for NTSC fields), then the number of to-be-encoded macroblocks is 396 for frame pictures and 196 for each field picture. According to MPEG-2, the search range can be as high as ±128 pixels in each direction. Furthermore, consider that MPEG-2 often provides a choice in selecting reference pictures for a to-be-encoded picture (i.e., a field-frame choice or a forward only, backward only or bi-predictive interpolated choice). In short, the number of potential candidate prediction macroblocks is very high. An exhaustive comparison of all prediction macroblock candidates to the to-be-encoded macroblock may therefore be too processing intensive for real-time encoding.
An exhaustive search can sometimes provide better memory access efficiency due to the overlap in pixels in each prediction macroblock candidate compared against a given to-be-encoded macroblock. For example, consider that a retrieved prediction macroblock candidate of 16×16 pixels includes a sub-array of 15×16 pixels of the prediction macroblock candidate to the immediate right or left (an of course a sub-array of 16×15 pixels of the prediction macroblock candidate immediately above or below). Thus only the missing 1×16 column of pixels need be retrieved to form the next left or right prediction macroblock candidate (or the missing 1×16 row of pixels need be retrieved to form the next above or below prediction macroblock candidate).
According to another technique, a hierarchical or telescopic search is performed, in which fewer than all possible choices are examined. These techniques, while computationally less demanding, are more likely to fail to obtain the optimal or best matching prediction macroblock candidate. As a result, more bits may be needed to encode the to-be-encoded macroblock in order to maintain the same quality than in the case where the best matching macroblock is obtained, or, if the number of bits per picture is fixed, the quality of the compressed picture will be degraded. Note also that the memory access efficiency is lower for the hierarchical search, since by definition, the amount of overlapping pixels between each prediction macroblock will be lower.
(C) Video Buffer Verifier
The encoding techniques described above produce a variable amount of encoded data for each picture (frame or field) of the video signal. The amount of encoded data produced for each picture depends on a number of factors including the amount of motion between the to-be-encoded picture and other pictures used as references for generating predictions therefor. For example, a video signal depicting a football game tends to have high motion pictures and a video signal depicting a talk show tends to have low motion pictures. Accordingly, the average amount of data produced for each picture of the football game video signal tends to be higher than the average amount of data produced for each picture of comparable quality of the talk show.
The allocation of bits from picture to picture or even within a picture may be controlled to generate a certain amount of data for that picture. However, the buffer at the decoder has a finite storage capacity. When encoding a video signal, a dynamically adjusted bit budget may be set for each picture to prevent overflow and underflow at the decoder buffer given the transmission bit rate, the storage capacity of the decoder buffer and the fullness of the decoder buffer over time. Note that varying the number of bits that can be allocated to a picture impacts the quality of the pictures of the video signal upon decoding.
The bit budget is set to prevent a decoder buffer underflow or overflow given a certain transmission channel bit rate. In order to prevent decoder buffer underflow and overflow, the encoder models the decoder buffer in order to determine the fullness of the decoder's buffer from time to time. The behavior of the decoder buffer is now considered in greater detail.
In modeling the decoder buffer, the encoder determines the buffer fullness of the decoder buffer. The encoder can know how many bits are present in the decoder buffer given the allocated transmission channel bit rate at which such pictures are transmitted to the decoder buffer, the delay between encoding a picture at the encoder and decoding a picture at the decoder, and the knowledge that the decoder buffer is assumed to remove the next to be decoded picture instantaneously at prescribed picture intervals. The encoder attempts to determine each maximum and minimum of the decoder buffer's fullness, which correspond to the number of bits in the buffer immediately before the decoder removes a picture and the number of bits in the buffer immediately after the decoder removes a picture, respectively. Given such information, the encoder can determine the number of bits to allocate to successive pictures to prevent decoder buffer underflows (when the decoder buffer does not have all of the bits of a picture in time for the decoder to decode them at a predefined decode time) or overflows (when the decoder buffer fullness exceeds the maximum decoder buffer storage capacity).
(D) Resolution/Standards Conversion
The use of high resolutions, high bit rates and/or inter-frame encoding can increase the difficulty of processing functions such as accessing stored compressed video streams, playing back more than one bit stream at the same time, and decoding/decompressing with trick modes such as fast forward and fast reverse. On the other hand, a compression system which utilizes compressed video bit streams having low resolution, low bit rate and/or only intra-frame encoding does not suffer these drawbacks. It is therefore desirable in many applications to provide a system in which multiple resolution and/or multiple bit rate versions of a given video signal can be compressed and stored. The high resolutions, high bit rates and inter-frame encoding can then be utilized when necessary, while the advantages of low resolution, low bit rates and intra-frame encoding can also be provided in appropriate applications.
Video servers represent another application in which storage of multiple versions of compressed video bit streams is desirable. Such video servers are used to deliver video bit streams to end users over data communication networks. For example, a World Wide Web server may be used to deliver video bit streams to different end users over different types of lines, including plain old telephone service (POTS) lines, integrated services digital network (ISDN) lines, T1 lines and the like. A version of a given compressed bit stream that may be suitable for a POTS user would be considered poor quality by a T1 user, and a bit stream suitable for a T1 user would be at too high a bit rate for a POTS user. It is therefore desirable for the video server to store a given video bit stream at multiple bit rates. The “optimal” resolution for a compressed video bit stream is the one that yields the best subjective video quality after decompression. This optimal resolution generally decreases with bit rate, such that it is desirable for the video server to compress the different bit rate streams at different resolutions.
The name of the process of converting a media file or object from one format to another is transcoding. Transcoding is often used to convert video formats (e.g., Beta to VHS, VHS to QuickTime, QuickTime to MPEG etc.). It can also be used in applications such as fitting HTML files and graphics files to the unique constraints of mobile devices and other Web-enabled products.
(E) Re-Encoding
Many video encoding applications utilize statistical multiplexing techniques to combine several compressed video bit streams into a single multiplexed bit stream, e.g., for transmission on a single channel. The bit rate of a given compressed stream generally varies with time based on the complexity of the corresponding video signals. A statistical multiplexer attempts to estimate the complexity of the various video frame sequences of a video signal and allocates channel bits among the corresponding compressed video bit streams so as to provide an approximately constant level of video quality across all of the multiplexed streams. For example, a given video frame sequence with a relatively large amount of spatial activity or motion may be more complex than other sequences and therefore allocated more bits than the other sequences.
Some statistical multiplexers use only a priori statistics, while others use both a priori and a posteriori statistics in allocating available channel bits. A statistics gatherer and encoder element 72 receives n video signals. These a priori statistics may include pre-encoding statistics gathered during the encoding of the respective video signal, or other a priori statistics (e.g., inter-pixel differences). To generate the a posterior statistics, the compressed video bit streams and the a priori statistics are retrieved. A transcoder has a decoder portion which decodes the retrieved compressed video bit streams to reproduce the video signals and an encoder portion which re-encodes the reproduced video signals to produce re-compressed video signals. In re-encoding the reproduced video signals, the transcoder gathers a posteriori statistics indicating the complexity involved in re-encoding the reproduced video signals. These a posteriori statistics and the a priori statistics are used in allocating available channel bits to achieve a desired bit rate.