Video encoder optimization for bit rate reduction of the compressed bitstreams and high visual quality preservation of the decoded video sequences encompasses solutions such as rate-distortion optimized mode decisions and parameter selections, frame type selections, background modeling, quantization modeling, perceptual modeling, analysis-based encoder control and rate control.
Generally, many video coding algorithms first partition each frame or video object plane (herein, “picture”) into small subsets of pixels, called “pixelblocks” herein. Then each pixelblock is coded using some form of predictive coding method such as motion compensation. Some video coding standards, e.g., ISO MPEG or ITU H.264, use different types of predicted pixelblocks in their coding. In one scenario, a pixelblock may be one of three types: Intra (I) pixelblock that uses no information from other pictures in its coding, Unidirectionally Predicted (P) pixelblock that uses information from one preceding picture, and Bidirectionally Predicted (B) pixelblock that uses information from one preceding picture and one future picture.
Consider the case where all pixelblocks within a given picture are coded according to the same type. Thus, the sequence of pictures to be coded might be represented as                I1 B2 B3 B4 P5 B6 B7 B8 B9 P10 B11 P12 B13 I14 . . .This is shown graphically in FIG. 5(a) where designations I, P, B indicate the picture type and the number indicates the camera or display order in the sequence. In this scenario, picture I1 uses no information from other pictures in its coding. P5 uses information from I1 in its coding. B2, B3, B4 all use information from both I1 and P5 in their coding.        
Since B pictures use information from future pictures, the transmission order is usually different than the display order. For the above sequence, transmission order might occur as follows:                I1 P5 B2 B3 B4 P10 B6 B7 B8 B9 P12 B11 I14 B13 . . .This is shown graphically in FIG. 5(b).        
Thus, when it comes time to decode B2 for example, the decoder will have already received and stored the information in I1 and P5 necessary to decode B2, similarly B3 and B4. The receiver then reorders the sequence for proper display. In this operation I and P pictures are often referred to as “stored pictures.”
The coding of the P pictures typically utilizes Motion Compensation predictive coding, wherein a Motion Vector is computed for each pixelblock in the picture. Using the motion vector, a prediction pixelblock can be formed by translation of pixels in the aforementioned previous picture. The difference between the actual pixelblock in the P picture and the prediction block, (the residual) is then coded for transmission.
Each motion vector may also be transmitted via predictive coding. That is, a prediction is formed using nearby motion vectors that have already been sent, and then the difference between the actual motion vector and the prediction is coded for transmission. Each B pixelblock typically uses two motion vectors, one for the aforementioned previous picture and one for the future picture. From these motion vectors, two prediction pixelblocks are computed, which are then averaged together to form the final prediction. As above the difference between the actual pixelblock in the B picture and the prediction block is then coded for transmission.
As with P pixelblocks, each motion vector of a B pixelblock may be transmitted via predictive coding. That is, a prediction is formed using nearby motion vectors that have already been transmitted, and then the difference between the actual motion vector and the prediction is coded for transmission.
However, with B pixelblocks the opportunity exists for interpolating the motion vectors from those in the co-located or nearby pixelblocks of the stored pictures. The interpolated value may then be used as a prediction and the difference between the actual motion vector and the prediction coded for transmission. Such interpolation is carried out both at the coder and decoder.
In some cases, the interpolated motion vector is good enough to be used without any correction, in which case no motion vector data need be sent. This is referred to as Direct Mode in H.263 and H.264. This works particularly well when the camera is slowly panning across a stationary background. In fact, the interpolation may be good enough to be used as is, which means that no differential information need be transmitted for these B pixelblock motion vectors. Within each picture the pixelblocks may also be coded in many ways. For example, a pixelblock may be divided into smaller sub-blocks, with motion vectors computed and transmitted for each sub-block. The shape of the sub-blocks may vary and need not be square.
Within a P or B picture, some pixelblocks may be better coded without using motion compensation, i.e., they would be coded as Intra (I) pixelblocks. Within a B picture, some pixelblocks may be better coded using unidirectional motion compensation, i.e., they would be coded as forward predicted or backward predicted depending on whether a previous picture or a future picture is used in the prediction.
Prior to transmission, the prediction error of a pixelblock or sub-block is typically transformed by an orthogonal transform such as the Discrete Cosine Transform or an approximation thereto. The result of the transform operation is a set of transform coefficients equal in number to the number of pixels in the pixelblock or sub-block being transformed. At the receiver/decoder, the received transform coefficients are inverse transformed to recover the prediction error values to be used further in the decoding.
Not all the transform coefficients need be transmitted for acceptable video quality. Depending on the transmission bit rate available more than half, sometimes much more than half, of the transform coefficients may be deleted and not transmitted. At the decoder their values are replaced by zeros prior to inverse transform.
Also, prior to transmission the transform coefficients are typically quantized and entropy coded. Quantization involves representation of the transform coefficient values by a finite subset of possible values, which reduces the accuracy of transmission and often forces small values to zero, further reducing the number of coefficients that are sent. In quantization typically, each transform coefficient is divided by a quantizer step size Q and rounded to the nearest integer. For example, the transform coefficient C would be quantized to the value Cq according to:
      C    q    =            (              C        +                  Q          2                    )        Q  The integers are then entropy coded using variable word-length codes such as Huffman codes or arithmetic codes.
The sub-block size and shape used for motion compensation may not be the same as the sub-block size and shape used for the transform. For example, 16×16, 16×8, 8×16 pixels or smaller sizes are commonly used for motion compensation whereas 8×8 or 4×4 pixels are commonly used for transforms. Indeed the motion compensation and transform sub-block sizes and shapes may vary from pixelblock to pixelblock.
A video encoder must decide what is the best way amongst all of the possible methods (or modes) to code each pixelblock. This is known as the mode selection problem. Depending on the pixelblock size and shape, there exist several modes for intra and inter cases, respectively.
A video encoder must also decide how many B pictures, if any, are to be coded between each I or P picture. This is known as the frame type selection problem, and again, ad hoc solutions have been used. Typically, if the motion in the scene is very irregular or if there are frequent scene changes, then very few, if any, B pictures should be coded. On the other hand, if there are long periods of slow motion or camera pans, then coding many B-pictures will result in a significantly lower overall bit rate. Moreover, a higher number of coded B frames makes possible achieving temporal/computational scalability at the decoder without impacting greatly the visual quality of the decoded sequence and the computational complexity of the decoder. Consequently, platforms and systems with various CPU and memory capabilities can make use of streams coded using numerous B frames.
Modern encoders typically select the number of B frames that occur between each I or P picture to be equal to one or two. This predetermined and somewhat arbitrary decision is motivated by experimental work, which shows that for most video sequences the above decision reduces the bit rate without affecting negatively the visual quality of the decoded sequences. The opportunity exists, however, to reduce the bit rate much more for sequences that exhibit slow motion or camera pans by increasing the number of B frames. It is believed that current coding systems do not take advantage of this opportunity, due to (a) the difficulty of the I/P/B decision and (b) the increase in the encoder's computational complexity that the implementation of the frame type decision would determine. Indeed, the appropriate number of B frames to be coded for each sequence not only depends on both the temporal and spatial characteristics of the sequence but it may vary across the sequence as the motion characteristics often change and a selection of different numbers of B frames for each different part of the sequence is typically required. Accordingly, there is a need in the art for a computationally inexpensive coding assignment scheme that dynamically assigns a number of B pictures to occur between reference pictures (I- and P-pictures) based on picture content.