In a preferred embodiment of the invention, the video encoder is an MPEG-2 compliant encoder. The encoder receives a sequence of frames from a video source. The sequence of frames may be progressive or interlaced. Illustratively, the progressive sequence comprises 30 frames per second. In the case of an interlaced sequence, each frame comprises two fields. A top field comprises the even numbered rows and a bottom field comprises the odd numbered rows. Thus, in the case of an interlaced sequence, there are 60 fields per second.
The video source may be any source of a digital video signal such as a video camera or a telecine machine. A telecine machine converts a film comprising 24 frames per second into a 60 field per second digital video signal using 3:2 pull down. The 3:2 pull down technique provides for generating two video fields and three video fields for alternating film frames. For a film frame which is converted into three video fields, the third field is a repeat of the first field.
The video encoder utilizes a compression algorithms to generate an MPEG-2 compliant bit stream from the input sequence of frames. (See ISO/IEC 13818-2)
The MPEG-2 bit stream has six layers of syntax. There are a sequence layer (random access unit, context), Group of Pictures layer (random access unit, video coding), picture layer (primary coding layer), slice layer (resynchronization unit), macroblock (motion compensation unit) and block layer (DCT unit). A group of pictures (GOP) is a set of frames which starts with an I-frame and includes a certain number of P and B frames. The number of frames in a GOP may be fixed or may be variable. Each frame is divided into macroblocks. Illustratively, a macroblock comprises four luminance blocks and two chrominance blocks. Each block is 8.times.8 pixels.
The encoder distinguishes between three kinds of frames (or pictures), I, P, and B. Typically, the coding of I frames results in the most bits. In an I-frame, each macroblock is coded as follows. Each 8.times.8 block of pixels in a macroblock undergoes a DCT (discrete cosine transform) transform to form a 8.times.8 array of transform coefficients. The transform coefficients are then quantized with a variable quantizer matrix. Quantization involves dividing each DCT coefficient Fv!u! by a quantizer step size. The quantizer step size for each AC DCT coefficient is determined by the product of a weighting matrix element Wv!u! and a quantization scale factor (also known as mquant). As is explained below, in some cases the quantization scale factor Q.sub.n for a macroblock n is a product of a rate control quantization scale factor Q.sub.n.sup.R and a masking activity quantization scale factor (QS.sub.n). However, this factorization of the quantization scale factor Q.sub.n is optional. The use of a quantization scale factor permits the quantization step size for each AC DCT coefficient to be modified at the cost of only a few bits. The quantization scale factor is selected for each macroblock.
The resulting quantized DCT coefficients are scanned (e.g., using zig-zag scanning) to form a sequence of DCT coefficients. The DCT coefficients are then organized into run-level pairs. The run-level pairs are then encoded using a variable length code (VLC). In an I-frame, each macroblock is encoded according to this technique.
In a P-frame, a decision is made to code each macroblock as an I macroblock, which macroblock is then encoded according to the technique described above, or to code the macroblock as a P macroblock. For each P macroblock, a prediction of the macroblock in a previous video frame is obtained. The predication is identified by a motion vector which indicates the translation between the macroblock to be coded in the current frame and its prediction in the previous frame. (A variety of block matching algorithms can be used to find the particular macroblock in the previous frame which is the best match with the macroblock to be coded in the current frame. This "best match" macroblock becomes the prediction for the current macroblock.) The predictive error between the predictive macroblock and the current macroblock is then coded using the DCT, quantization, zig-zig scanning, run-level pair encoding, and VLC encoding.
In the coding of a B-frame, a decision has to be made as to the coding of each macroblock. The choices are (a) intracoding (as in an I macroblock), (b) unidirectional forward predictive coding using a previous frame to obtain a motion compensated prediction, (c) unidirectional backward predictive coding using a subsequent frame to obtain a motion compensated prediction, and (d) bidirectional predictive coding, wherein a motion compensated prediction is obtained by interpolating a backward motion compensated prediction and a forward motion compensated prediction. In the cases of forward, backward, and bidirectional motion compensated prediction, the predictive error is encoded using DCT, quantization, zig-zig scanning, run-level pair encoding and VLC encoding.
The P frame may be predicted from an I frame or another P frame. The B frame may also be predicted from an I frame or a P frame. No predictions are made from B frames.
B frames have the smallest number of bits when encoded, then P frames, with I frames having the most bits when encoded. Thus, the greatest degree of compression is achieved for B frames. For each of the I, B, and P frames, the number of bits resulting from the encoding process can be controlled by controlling the quantizer step size (adaptive quantization) used to code each macroblock. A macroblock of pixels or pixel errors which is coded using a large quantizer step size results in fewer bits than if a smaller quantizer step size is used.
After encoding by the video encoder, the bit stream is stored in an encoder output buffer. Then, the encoded bits are transmitted via a channel to a decoder, where the encoded bits are received in a buffer of the decoder, or the encoded bits may be stored in a storage medium.
The order of the frames in the encoded bit stream is the order in which the frames are decoded by the decoder. This may be different from the order in which the frames arrived at the encoder. The reason for this is that the coded bit stream contains B frames. In particular, it is necessary to code the I and P frames used to anchor a B frame before coding the B frame itself.
Consider the following sequence of frames received at the input of a video encoder and the indicated coding type (I, P or B) to be used to code each frame:
______________________________________ 1 2 3 4 5 6 7 8 9 10 11 12 13 I B B P B B P B B I B B P ______________________________________
For this example there are two B-frames between successive coded P-frames and also two B-frames between successive coded I- and P-frames. Frames "1I" is used to from a prediction for frame "4P, and frames "1I" and "4P" are both used to form predictions for frames "2B" and "3B". Therefore, the order of coded frames in the coded sequence shall be "1I", "4P", "2B", "3B". Thus, at the encoder output, in the coded bit stream, and at the decoder input, the frames are reordered as follows:
______________________________________ 1 4 2 3 7 5 6 10 8 9 13 11 12 I P B B P B B I B B P B B ______________________________________
In the case of interlaced video the following applies. Each frame of interlaced video consists of two fields. The MPEG-2 specification allows the frame to be encoded as a frame picture or the two fields to be encoded as two field pictures. Frame encoding or field encoding can be adaptively selected on a frame-by-frame basis. Frame encoding is typically preferred when the video scene contains significant detail with limited motion. Field encoding, in which the second field can be predicted from the first, works better when there is the fast movement.
For field prediction, predictions are made independently for the macroblocks of each field by using data from one or more previous fields (P field) or previous and subsequent fields (B field). For frame prediction, predictions are made for the macroblocks in a frame from a previous frame (P frame) or from a previous and subsequent frame (B frame). Within a field picture, all predictions are field predictions. However, in a frame picture either field prediction or frame prediction may be selected on a macroblock by macroblock basis.
An important aspect of any video encoder is rate control. The purpose of rate control is to maximize the perceived quality of the encoded video when it is decoded at a decoder by intelligently allocating the number of bits used to encode each frame and each macroblock within a frame. Note the encoder may be a constant bit rate (CBR) encoder or a variable bit rate (VBR) encoder. In the case of constant bit rate encoder, the sequence of bit allocations to successive frames ensures that an assigned channel bit rate is maintained and that decoder buffer exceptions (overflow or underflow of decoder buffer) are avoided. In the case of a VBR encoder, the constraints are reduced. It may only be necessary to insure that a maximum channel rate is not exceeded so as to avoid decoder buffer underflow.
In order to prevent a decoder buffer exception, the encoder maintains a model of the decoder buffer. This model maintained by the encoder is known as the video buffer verifier (VBV) buffer. The VBV buffer models the decoder buffer occupancy. Depending on the VBV occupancy level, the number bits which may be budgeted for a particular frame may be increased or decreased to avoid a decoder buffer exception.
It is an object of the present invention to provide a rate control technique for an MPEG-2 compliant encoder.
Specifically, it is an object of the invention to provide a rate control technique for a constant bit rate, real time MPEG-2 compliant encoder.
It is also an object of the invention to provide a rate control technique for a variable bit rate, non-real time MPEG-2 compliant encoder.