The MPEG-2 standard is widely used for encoding video. According to this standard, pictures are both spatially and temporally encoded. Each picture is first divided into non-overlapping macroblocks, where each macroblock includes a 16.times.16 array of luminance samples and each block or array of 8.times.8 chrominance samples overlaid thereon. A decision is made to encode the macroblock as an inter macroblock, in which case the macroblock is both temporally and spatially encoded, or to encode the macroblock as an intra macroblock, in which case the macroblock is only spatially encoded. A macroblock is temporally encoded by an inter-picture motion compensation operation. According to such an operation, a prediction macroblock is identified for the to-be-motion compensated macroblock and is subtracted therefrom to produce a prediction error macroblock. The prediction macroblock originates in another picture, called a reference picture, or may be an interpolation of multiple prediction macroblocks, each originating in different reference pictures. The prediction macroblock need not have precisely the same spatial coordinates (pixel row and column) as the macroblock from which it is subtracted and in fact can be spatially offset therefrom. A motion vector is used to identify the macroblock by its spatial shift and by the reference picture from which it originates. (When the prediction macroblock is an interpolation of multiple prediction macroblocks, a motion vector is obtained for each to-be-interpolated prediction macroblock).
Pictures may be classified as intra or I pictures, predictive or P pictures and bidirectionally predictive or B pictures. An I picture contains only intra macroblocks. A P picture may contain inter macroblocks, but only forward directed predictions from a preceding reference picture are permitted. A P picture can also contain intra macroblocks for which no adequate prediction was found. In addition, a dual prime prediction may be formed for a P picture macroblock in an interlaced picture, which is an interpolated prediction from the immediately two preceding reference fields. A B picture can contain intra macroblocks, inter macroblocks that are forward direction motion compensated, inter macroblocks that are backward directed motion compensated, i.e., predicted from a succeeding reference picture, and inter macroblocks that are bidirectionally motion compensated, i.e., predicted from an interpolation of prediction macroblocks in each of preceding and succeeding reference pictures. If the P or B pictures are interlaced, then each component field macroblock can be separately motion compensated or the two fields can be interleaved to form a frame macroblock and the frame block can be motion compensated at once.
Spatial compression is performed on selected 8.times.8 luminance pixel blocks and selected 8.times.8 pixel chrominance blocks of selected prediction error macroblocks, or selected intra macroblocks. Spatial compression includes the steps of discrete cosine transforming each block, quantizing each block, zig-zag (or alternate) scanning each block into a sequence, run-level encoding the sequence and variable length encoding the run-level encoded sequence. Prior to discrete cosine transformation, a macroblock of a frame picture may optionally be formatted as a frame macroblock, including blocks containing alternating lines of samples from each of the two component field pictures of the frame picture, or as a field macroblock, where the samples from different fields are arranged into separate blocks of the macroblock. The quantizer scale factor may be changed on a macroblock-by-macroblock basis and the weighting matrix may be changed on a picture-by-picture basis. Macroblocks, or coded blocks thereof, may be skipped if they have zero (or nearly zero) valued coded data. Appropriate codes are provided into the formatted bitstream of the encoded video signal, such as non-contiguous macroblock address increments, or coded block patterns, to indicate skipped macroblocks and blocks.
Additional formatting is applied to the variable length encoded sequence to aid in identifying the following items within the encoded bitstream: individual sequences of pictures, groups of pictures of the sequence, pictures of a group of pictures, slices (contiguous sequences of macroblocks of a single macroblock row) of pictures, macroblocks of slices and motion vectors and blocks of macroblocks. Some of the above layers are optional, such as the group of pictures layer and the slice layer, and may be omitted from the bitstream if desired. (If slice headers are included in the bitstream, one slice header is provided for each macroblock row.) Various parameters and flags are inserted into the formatted bitstream as well indicating each of the above noted choices (as well as others not described above). The following is a brief list of some of such parameters and flags: picture coding type (I,P,B), macroblock type (i.e., forward predicted, backward predicted, bidirectionally predicted, spatially encoded only) macroblock prediction type (field, frame, dual prime, etc.), DCT type (i.e., frame or field macroblock format for discrete cosine transformation), the quantizer scale code, etc. A repeat_first_field flag may be inserted into the encoded video signal to indicate that a field repeated during a telecine process of converting film frames to NTSC video (using the well known 3:2 pull-down technique) was omitted from the encoded video signal. In addition, error concealment motion vectors optionally may be provided with intra macroblocks for motion compensated recovery of another macroblock in the event the other macroblock is corrupted due to an error.
In encoding the video signal according to MPEG-2, the encoder must produce a bitstream which does not overflow or underflow the buffer of a decoder which decodes the video signal. To that end, the encoder models the decoder's buffer and, in particular, monitors the fullness of the decoder's buffer. The decoder buffer is presumed to fill with bits of the bitstream at a particular rate which is a function of the channel rate at a certain moment of time. Pictures are presumed to be instantly removed at a particular instant relative to the decode and presentation time of each picture. See U.S. patent application Ser. No. 09/084,690 for an in-depth discussion of the modeling of the decoder buffer by an encoder. Using such a model, the encoder can adjust various encoding parameters to control the number of bits produced for each encoded picture in an effort to prevent overflowing or underflowing the decoder's buffer. For example, the encoder can adjust the quantizer scale factor, encourage selection of certain types of encoding over others, add stuffing data to pictures, change the number of B and P pictures, change a threshold quality level used in determining whether to perform intra or inter coding of macroblocks, etc., to increase or reduce the number of bits produced for each picture. Generally speaking, the encoder forms a target bit budget for each picture, which is a function of, among other parameters, the channel rate, the decoder buffer size (normally assumed to be a certain constant), and the vacancy/occupancy of the decoder's buffer immediately before and after removal of the particular picture for which a budget is being generated. The encoder then adjusts its encoding in an attempt to achieve the target bit budget for the picture.
Occasionally, it is desired to re-encode a previously encoded video signal. For example, in some video server or network situations, it is desirable to re-encode the video signal in a fashion other than it was originally encoded to meet network congestion/bandwidth availability constraints, to provide the video signal to different users with varying decoder capability, etc. In another example, a video signal is prepared in one format for professional delivery (for example, IF editing prior to broadcast), and is later to be delivered in a format suitable for consumer use (e.g., broadcast of the final edited version). U.S. patent application Ser. No. 08/775,313 teaches an advantageous transcoder which decodes a received, encoded video signal and then re-encodes the video signal. The transcoder taught in this incorporated patent application has a decoder which optionally provides auxiliary information or meta data to the encoder of the transcoder for IS facilitating the encoding. Such meta data may be indicative of different kinds of information contained in the bitstream as originally encoded such as, the motion vectors used, number of bits per picture, quantization scale factors, type of prediction for each macroblock, field or frame formats used for macroblocks, picture coding types and locations of repeat fields. Such meta data indicates various decisions previously made in encoding the video signal which can frequently be re-used in whole or in part. For example, repeat field decisions can usually be re-used so long as the video standard does not change. Alternatively, or additionally, motion vectors may be used wholly or to indicate a smaller search window for identifying candidate prediction macroblocks than would otherwise normally be necessary.
Generally speaking, it is desirable to use the same picture coding type and the same intra/inter macroblock decisions in the subsequent encoding of the transcoding operation as was done in originally encoding the video signal fed to the transcoder. This maintains picture quality. When encoding a picture, it is not necessary to use a fixed group of pictures structure, field/frame format or a regular field display code. Normally, an encoder has many choices in encoding a video signal, especially in regard to preventing decoder buffer overflow and underflow. However, in the case of a transcoder, the picture coding type and inter/intra macroblock decision is preferably constrained to be the same during a successive encoding as it was during the previous encoding. As such, the encoder of a transcoder has only two options available for varying the encoding. First, while the transcoder's decoder decodes pictures of the bitstream, information regarding the decoded picture types can be gathered. The transcoder's encoder extrapolates from this information as to what picture types are expected and allocates bits accordingly. However, this solution does not work well if the group of pictures structure of the bitstream changes. For example, the group of picture structure can change from IBBPBBPBBPBBI to IIIIIII. In such a case, the extrapolation of picture coding type will be erroneous. In the example above, the unanticipated rise in I picture frequency will result in a incorrect allocation of bits and degraded quality for unanticipated I pictures.
Second, the transcoder can make no assumption about picture types and simply scale the number of bits used in the original encoding according to the ratio of the bit rate of the originally encoded bitstream to the bit rate of the re-encoded bitstream produced by the transcoder. However, this solution does not work well if the bit rate of the originally encoded bitstream fed to the transcoder is far higher than the bit rate of the re-encoded bitstream produced by the transcoder. The reason for this is that the difference in the number of bits used for different picture coding types is inversely correlated with the bit rate of the signal. Thus, at very high bit rates, B pictures have a similar number of bits of encoded data as I pictures yet at low bit rates, I pictures have far more bits of encoded data than B pictures.
A video program normally includes an encoded video signal and at least one encoded audio signal (although the video program can include a second encoded audio signal, a closed captioned text signal, and other auxiliary signals). Often, it is desired to combine multiple video programs and transmit the combined signal on a transmission channel having a particular channel bit rate. A preferred manner of combining and transmitting such video programs is to statistically and dynamically allocate the channel bit rate amongst all video programs. The dynamic statistical allocation can be done in a fashion to attempt to achieve the same quality over all combined video programs. For example, assume that first and second video programs are to be combined, wherein the first video program carries a low complexity video event, such as a talk show, and the second video program carries a high complexity event, such as a football game. As is known, a high complexity event with high inter-picture motion, such as a football game, will require a higher bit rate to maintain the same quality as a low complexity event with low inter-picture motion, such as a talk show. Thus, the first video program is likely to be allocated lower channel bit rates than the second video program in order to maintain the overall quality between the two video programs approximately the same.
U.S. patent application Ser. No. 08/775,313 teaches a system in which multiple encoders are provided, including one encoder for encoding a corresponding one of multiple video signals to be multiplexed together. While encoding these video signals, the encoders gather a priori "pre-encoding statistics" for these video signals such as: a number of bits generated for each compressed picture, an average quantization level, picture types, scene change locations and repeat field patterns. These statistics are stored in a storage medium. According to one embodiment, the encoded video signals are stored in encoded form in the storage medium (or another storage medium) as well. Multiple transcoders are provided including one transcoder for transcoding each encoded video signal. The above-noted, previously generated a priori, pre-encoding statistics are provided to a statistics computer. Using such statistics as an indication of the complexity of encoding the video signal, the statistics computer allocates a fraction of the transmission medium bit rate to each transcoder. The bit rates determined by the statistics computer may be generated in a fashion to approximately average the quality of each transcoded video signal at that moment in time. Each transcoder then adjusts its re-encoding according to the newly allocated bit rates. In particular, each transcoder adjusts the rate at which bits of the encoded video signal are presumed to fill the decoder buffer in the model of the decoder buffer maintained at the transcoder according to the newly allocated transmission rate. This in turn affects the number of bits each transcoder allocates to each picture during the re-encoding process.
The technique taught in this incorporated application can provide superior results in the statistical multiplexing scenario. However, this technique assumes that the raw, unencoded video is available for pre-encoding gathering of statistics. This is not always the case. For example, the originally encoded video signal may have been generated at a remote location and/or using an encoder not under the control of the operator who wishes to perform a subsequent transcoding. In addition, it is desirable to provide a solution for adjusting bit budgets for individual video signals even when the channel rate allocated to carrying the re-encoded video signal does not vary. It is therefore desirable to provide a more general solution to the bit allocation problem in the context of transcoding.