1. Field of the Invention
This present invention relates generally to the field of data compression, and, more specifically, to a system and method for reencoding segments of buffer constrained video streams, such as MPEG video streams.
2. Discussion of the Prior Art
The digital video compression techniques are essential to many applications because the storage and transmission of uncompressed video signal requires very large amounts of memory and channel bandwidth. The dominant digital video compression techniques are specified by the international standards MPEG-1 (ISO/IEC 11718-2) and MPEG-2 (ISO/IEC 13818-2) developed by the Moving Picture Experts Group (MPEG), part of a joint technical committee of the International Standards Organization (ISO) and the International Electrotechnical Commission (IEC). These standards were developed for coding of motion pictures and associated audio signals for the variety applications involving the transmission and storage of compressed digital video, including high-quality digital television transmission via coaxial networks, fiber-optic networks, terrestrial broadcast or direct satellite broadcast; and in interactive multimedia contents stored on CD-ROM, Digital Tape, Digital Video Disk, and disk drives. The standards specify the syntax of the compressed bit stream and the method of decoding, but leave considerable latitude for novelty and variety in the algorithm employed in the encoder.
Some pertinent aspects of the MPEG-2 video compression standard will now be reviewed with further reference to commonly-owned U.S. Pat. No. 5,231,484, the contents and disclosure of which is incorporated by reference as if fully set forth herein.
To begin with, it will be understood that the compression of any data object, such as a page of text, an image, a segment of speech, or a video sequence, can be thought of as a series of steps, including: Step 1) a decomposition of that object into a collection of tokens; Step 2) the representation of those tokens by binary strings that have minimal length in some sense; and Step 3) the concatenation of the strings in a well-defined order. Steps 2 and 3 are lossless; i.e., the original data is faithfully recoverable upon reversal. Step 2 is known as entropy coding.
Step 1 can be either lossless or lossy. Most video compression algorithms are lossy because of stringent bit-rate requirements. A successful lossy compression algorithm eliminates redundant and irrelevant information, allowing relatively large errors where they are not likely to be visually significant and carefully representing aspects of a sequence to which the human observer is very sensitive. The techniques employed in the MPEG-2 standard for Step 1 can be described as predictive/interpolative motion-compensated hybrid DCT/DPCM coding. Huffman coding, also known as variable length coding, is used in Step 2. Although, as mentioned, the MPEG-2 standard is really a specification of the decoder and the compressed bit stream syntax, the following description of the MPEG-2 specification is, for ease of presentation, primarily from an encoder point of view.
The MPEG video standards specify a coded representation of video for transmission. The standards are designed to operate on interlaced or noninterlaced component video. Each picture has three components: luminance (Y), red color difference (CR), and blue color difference (CB). For 4:2:0 data, the CR and CB components each have half as many samples as the Y component in both horizontal and vertical directions. For 4:2:2 data, the CR and CB components each have half as many samples as the Y component in the horizontal direction but the same number of samples in the vertical direction. For 4:4:4 data, the CR and CB components each have as many samples as the Y component in both horizontal and vertical directions.
An MPEG data stream consists of a video stream and an audio stream that are packed, with systems information and possibly other bit streams, into a systems data stream that can be regarded as layered. Within the video layer of the MPEG data stream, the compressed data is further layered. A description of the organization of the layers will aid in understanding the invention.
The layers pertain to the operation of the compression scheme as well as the composition of a compressed bit stream. The highest layer is the Video Sequence Layer, containing control information and parameters for the entire sequence. At the next layer, a sequence is subdivided into sets of consecutive pictures, each known as a Group of Pictures (GOP). A general illustration of this layer is shown in FIG. 3. Decoding may begin at the start of any GOP, essentially independent of the preceding GOP""s. There is no limit to the number of pictures that may be in a GOP, nor do there have to be equal numbers of pictures in all GOP""s.
The third or xe2x80x9cPicturexe2x80x9d layer is a single picture. A general illustration of this layer is shown in FIG. 4. The luminance component of each picture is subdivided into 16xc3x9716 regions; the color difference components are subdivided into appropriately sized blocks spatially co-situated with the 16xc3x9716 luminance regions; for 4:4:4 video, the color difference components are 16xc3x9716, for 4:2:2 video, the color difference components are 8xc3x9716, and for 4:2:0 video, the color difference components are 8xc3x978. Taken together, these co-situated luminance region and color difference regions make up the fifth layer, known as xe2x80x9cmacroblockxe2x80x9d (MB). Macroblocks in a picture are numbered consecutively in raster scan order.
Between the Picture and MB layers is the fourth or xe2x80x9cSlicexe2x80x9d layer. Each slice consists of some number of consecutive MB""s. Slices need not be uniform in size within a picture or from picture to picture.
Finally, as shown in FIG. 5, each MB consists of four 8xc3x978 luminance blocks and 8, 4, or 2 (for 4:4:4, 4:2:2 and 4:2:0 video) chrominance blocks. If the width of the luminance component in picture elements or pixels of each picture is denoted as C and the height as R (C is for columns, R is for rows), a picture is C/16 MB""s wide and R/16 MB""s high.
The Sequence, GOP, Picture, and Slice layers all have headers associated with them. The headers begin with byte-aligned xe2x80x9cStart Codesxe2x80x9d and contain information pertinent to the data contained in the corresponding layer.
A picture can be either field-structured or frame-structured. A frame-structured picture contains information to reconstruct an entire frame, i.e., two fields, of data. A field-structured picture contains information to reconstruct one field. If the width of each luminance frame (in picture elements or pixels) is denoted as C and the height as R (C is for columns, R is for rows), a frame-structured picture contains information for Cxc3x97R pixels and a frame-structured picture contains information for Cxc3x97R/2 pixels.
A macroblock in a field-structured picture contains a 16xc3x9716 pixel segment from a single field. A macroblock in a frame-structured picture contains a 16xc3x9716 pixel segment from the frame that both fields compose; each macroblock contains a 16xc3x978 region from each of two fields.
Each frame in an MPEG-2 sequence must consist of two coded field pictures or one coded frame picture. It is illegal, for example, to code two frames as one field-structured picture followed by one frame-structured picture followed by one field-structured picture; the legal combinations are: two frame-structured pictures, four field-structured pictures, two field-structured pictures followed by one frame-structured picture, or one frame-structured picture followed by two field-structured pictures. Therefore, while there is no frame header in the MPEG-2 syntax, conceptually one can think of a frame layer in MPEG-2.
Within a GOP, three xe2x80x9ctypesxe2x80x9d of pictures can appear. An example of the three types of pictures within a GOP is shown in FIG. 6. The distinguishing feature among the picture types is the compression method used. The first type, Intramode pictures or I pictures, are compressed independently of any other picture. Although there is no fixed upper bound on the distance between I pictures, it is expected that they will be interspersed frequently throughout a sequence to facilitate random access and other special modes of operation. Predictively motion-compensated pictures (P pictures) are reconstructed from the compressed data in that picture and two most recently reconstructed fields from previously displayed I or P pictures. Bidirectionally motion-compensated pictures (B pictures) are reconstructed from the compressed data in that picture plus two reconstructed fields from previously displayed I or P pictures and two reconstructed fields from I or P pictures that will be displayed in the future. Because reconstructed I or P pictures can be used to reconstruct other pictures, they are called reference pictures.
One very useful image compression technique is transform coding. In MPEG and several other compression standards, the discrete cosine transform (DCT) is the transform of choice. The compression of an I picture is achieved by the steps of 1) taking the DCT of blocks of pixels, 2) quantizing the DCT coefficients, and 3) Huffman coding the result. In MPEG, the DCT operation converts a block of 8xc3x978 pixels into an 8xc3x978 set of transform coefficients. The DCT transformation by itself is a lossless operation, which can be inverted to within the precision of the computing device and the algorithm with which it is performed.
The second step, quantization of the DCT coefficients, is the primary source of loss in the MPEG standards. Denoting the elements of the two-dimensional array of DCT coefficients by cmn, where m and n can range from 0 to 7, aside from truncation or rounding corrections, quantization is achieved by dividing each DCT coefficient cmn by wmn x QP, with wmn being a weighting factor and QP being the macroblock quantizer. Note that QP is applied to each DCT coefficient. The weighting factor wmn allows coarser quantization to be applied to the less visually significant coefficients.
There can be several sets of these weights. For example, there can be one weighting factor for I pictures and another for P and B pictures. Custom weights may be transmitted in the video sequence layer, or default values may be used. The macroblock quantizer parameter is the primary means of trading off quality vs. bit rate in MPEG-2. It is important to note that QP can vary from MB to MB within a picture. This feature, known as adaptive quantization (AQ), permits different regions of each picture to be quantized with different step-sizes, and can be used to equalize (and optimize) the visual quality over each picture and from picture to picture. Typically, for example in MPEG test models, the macroblock quantizer is computed as a product of the macroblock masking factor and the picture nominal quantizer (PNQ).
Following quantization, the DCT coefficient information for each MB is organized and coded, using a set of Huffman codes. As the details of this step are not essential to an understanding of the invention and are generally understood in the art, no further description will be offered here.
Most video sequences exhibit a high degree of correlation between consecutive pictures. A useful method to remove this redundancy before coding a picture is motion compensation. MPEG-2 provides several tools for motion compensation (described below).
All the methods of motion compensation have the following in common. For each macroblock, one or more motion vectors are encoded in the bit stream. These motion vectors allow the decoder to reconstruct a macroblock, called the predictive macroblock. The encoder subtracts the predictive macroblock from the macroblock to be encoded to form the difference macroblock. The encoder uses tools to compress the difference macroblock that are essentially similar to the tools used to compress an intra macroblock.
The type of picture determines the methods of motion compensation that can be used. The encoder chooses from among these methods for each macroblock in the picture. A method of motion compensation is described by the macroblock mode and motion compensation mode used. There are four macroblock modes, intra (I) mode, forward (F) mode, backward (B) mode, and interpolative forward-backward (FB) mode. For I mode, no motion compensation is used. For the other macroblock modes, 16xc3x9716 (S) or 16xc3x978 (E) motion compensation modes can be used. For F macroblock mode, dual-prime (D) motion compensation mode can also be used.
The MPEG standards may be used with both constant-bit-rate and variable-bit-rate transmission and storage media. The number of bits in each picture will be variable, due to the different types of picture processing, as well as the inherent variation with time of the spatio-temporal complexity of the scene being coded. The MPEG standards use a buffer-based rate control strategy, in the form of a Virtual Buffer Verifier (VBV), to put meaningful bounds on the variation allowed in the bit rate. As depicted in FIG. 1, the VBV is devised as a decoder buffer 101 followed by a hypothetical decoder 103, whose sole task is to place bounds on the number of bits used to code each picture so that the overall bit rate equals the target allocation and the short-term deviation from the target is bounded. The VBV can operate in either constant-bit-rate or variable-bit-rate mode.
In constant-bit-rate mode, the Decode Buffer 101 is filled at a constant bit rate with compressed data in a bit stream from the storage or transmission medium. Both the buffer size and the bit rate are parameters that are transmitted in the compressed bit stream. After an initial delay, which is also derived from information in the bit stream, the hypothetical decoder 103 instantaneously removes from the buffer all of the data associated with the first picture. Thereafter, at intervals equal to the picture rate of the sequence, the decoder removes all data associated with the earliest picture in the buffer.
The operation of the VBV is shown by example in FIG. 7 which depicts the fullness of the decoder buffer over time. The buffer starts with an initial buffer fullness of Bi after an initial delay of time T0. The sloped line segments show the compressed data entering the buffer at a constant bit rate. The vertical line segments show the instantaneous removal from the buffer of the data associated with the earliest picture in the buffer. In this example, the pictures are shown to be removed at a constant interval of time T. In general, the picture display interval, i.e., the time interval between the removal of consecutive pictures, may be variable.
For the bit stream to satisfy the MPEG rate control requirements, it is necessary that all the data for each picture be available within the buffer at the instant it is needed by the decoder and that the decoder buffer does not overfill. These requirements translate to upper Uk and lower Lk bounds on the number of bits allowed in each picture (k). The upper and lower bounds for a given picture depend on the number of bits used in all the pictures preceding it. For example, the second picture may not contain more than U2 bits since that is the number of bits available in the buffer when the second picture is to be removed, nor less than L2 bits since removing less than L2 bits would result in the buffer overflowing with incoming bits. It is a function of the encoder to produce bit streams that can be decoded by the VBV without error.
For constant-bit-rate operation, the buffer fullness just before removing a picture from the buffer is equal to the buffer fullness just before removing the previous picture minus the number of bits in the previous picture plus the product of the bit rate and the amount of time between removing the picture and the previous picture; i.e.,
buffer fullness before remove pic=buffer fullness before remove last picxe2x88x92bits in last pic+(time between pic and last pic*bit rate)
The upper bound for the number of bits in a picture is equal to the buffer fullness just before removing that picture from the buffer. The lower bound is the greater of zero bits or the buffer size minus the buffer fullness just before removing that picture from the buffer plus the number of bits that will enter the buffer before the next picture is removed. The buffer fullness before removing a given picture depends on the initial buffer fullness and the number of bits in all of the preceding pictures, and can be calculated by using the above rules.
Variable-bit-rate operation is similar to the above, except that the compressed bit stream enters the buffer at a specified maximum bit rate until the buffer is full, when no more bits are input. This translates to a bit rate entering the buffer that may be effectively variable, up to the maximum specified rate. An example plot of the VBV fullness under variable-bit-rate operation is shown in FIG. 8. The buffer operates similarly to the constant-bit-rate case except that the buffer fullness, by definition, cannot exceed the buffer size of Bmax. This leads to an upper bound on the number of bits produced for each picture, but no lower bound.
For variable bit rate operation, the buffer fullness just before removing a picture from the buffer is equal to the size of the buffer or to the buffer fullness just before removing the previous picture minus the number of bits in the previous picture plus the maximum bit rate times the amount of time between removing the picture and the previous picture, whichever is smaller; i.e.,
buffer fullness before remove pic=min(buffer fullness before remove last picxe2x88x92bits in last pic+time between pic and last pic*bit rate, buffer size)
The upper bound for the number of bits in a picture is again equal to the buffer fullness just before removing that picture from the buffer. As mentioned earlier, the lower bound is zero. The buffer fullness before removing a given picture again depends on the initial buffer fullness and the number of bits in all of the preceding pictures, and can be calculated by using the above rules.
The decoding requires that all the data for each picture is available within the buffer at the instant it is needed by the decoder. It is the function of the encoder to produce bit streams that conform to the VBV requirements, i.e., no buffer underflow occurrence in both CBR and VBR operations and no buffer overflow occurrence in CBR operation.
The international standards MPEG-1 and MPEG-2 are widely employed in applications involving digital video compression. In the production of MPEG compressed digital video contents, such as digital TV programs, DVDs, and other multimedia contents, there are often some segments of the video with unsatisfactory visual picture quality due to the varying complexity of the video content. The existing approach is to encode the entire program again at a higher bit rate with the intention to improve the picture quality of these segments. Since the programs or movies are one or two hours long, this reencoding approach not only consumes twice much of the production time but also can waste very significant amount of the storage media and transmission bandwidth. For example, to reencode a two hour long movie at a bit rate of 5 Mbits/sec, which was previously coded at 3 Mbits/sec, will take at least another 2 hours and additional storage space of 14400 Mbits; this is a 67% increase. Furthermore, there is no guarantee that the reencoded video stream will have a satisfactory picture quality. The reencoding process may have to be repeated at an even higher bit rate, wasting a lot of time, storage space, and transmission bandwidth unnecessarily on the segments that already have satisfactory visual quality.
Therefore better solutions for reencoding problem are needed to greatly improve the production efficiency and reduce the cost.
It is an object of the invention to provide a system and method for reencoding only the segments which have unsatisfactory visual quality in a video stream so that significant production time, storage space, and bandwidth can be saved.
According to one aspect of the invention, there is provided a system and reencode methodology that includes: 1) the step of allocating the target number of bits to each picture in the specified segment to be reencoded in accordance with the user desired average bit rate or the limits of the VBV buffer constraints; 2) the step of reencoding the segment according to the target bits allocation, and, 3) the steps of merging the stream of the reencoded segment with the original stream to replace the original segment and verifying that the merged stream will still conform to the VBV buffer constraints.
In one embodiment of the invention, the step 1) of allocating the target number of bits to the pictures uses the information obtained by an information collection unit. The information including, for instance, the measures of the picture quality index, the complexity, the coding bits used, the average quantization scale, and the coding type of each picture coded in the original bit stream. It also includes the special and spatial-temporal activity measures as well as the information of GOP structure, resolution, format, etc. The information may either be obtained by on-line collecting during the previous encoding which produced the original stream or by off-line analyzing the bit stream and video source. In the preferred embodiment, the information is collected during the encoding and stored in a statistical file. The target allocation step also preferably invokes a method for dividing the segment into intervals with relatively homogeneous contents. The available bits for re-encoding the segment are derived from the user desired average bit rate and are then allocated among the pictures in the segment based on the picture quality measures and the picture complexities derived from the information in the statistical file. The target allocation step 1) also implements a procedure for calculating the minimum required VBV buffer fullness at the end of the segment and the procedure to adjust the target allocation so that the bit stream of the reencoded segment itself and the combined new bit stream after it is merged into the original bit stream will still conform to all the VBV buffer constraints.
The reencoding step 2) encodes the corresponding segment of the signal source again and produces a new bit stream for the segment along with the associated statistical file. During the encoding, this step may preferably invoke a rate control method so that the actual bits spent on each picture are close to the allocated target bits. The segment may be re-encoded at a higher or lower bit rate than the encoded bit rate for the segment of the original stream.
The merging step 3) first verifies that if the old segment in the original bit stream is replaced with the new bit stream generated by reencoding, the resulting stream is indeed compliant to the VBV buffer constraints. If this is true, the new bit stream is merged into the original bit stream to replace the old segment. In the preferred embodiment, the statistical file corresponding to the reencoded segment is also merged into the original statistical file so that the segments in successive reencoding may be overlapped.
Advantageously, the system and method of the invention is applicable to any encoded bit stream and encompasses both audio and video applications. dr
Further features, aspects and advantages of the apparatus and methods of the present invention will become better understood with regard to the following description, appended claims, and accompanying drawings where:
FIG. 1 is a physical depiction of the Virtual Buffer Verifier;
FIG. 2 is an overview of a conventional video compression system;
FIG. 3 illustrates an exemplary set of Group of Pictures (GOP""s) layer of compressed data within the video compression layer of an MPEG data stream;
FIG. 4 illustrates an exemplary Macroblock (MB) subdivision of a picture in the MB layer of compressed data within the video compression layer of an MPEG data stream;
FIG. 5 illustrates the Block subdivision of a Macroblock;
FIG. 6 illustrates the type of pictures in an exemplary Group of Pictures;
FIG. 7 illustrates an exemplary plot of the evolution of a virtual decoder buffer over time for operation in constant-bit-rate mode;
FIG. 8 illustrates an exemplary plot of the evolution of a virtual decoder buffer over time for operation in variable-bit-rate mode;
FIG. 9 illustrates a block diagram of the preferred embodiment of the present invention; and,
FIG. 10 illustrates an exemplary plot of the segments and intervals of pictures according to the principles of the invention.