The present invention relates to the field of encoder control more specifically, the present invention relates to frame encoding decision for video sequence to achieve a minimum sequence cost within a sequence resource budget.
Video compression takes advantage of these redundancies, with the intention to reduce the size of the video data while maintaining the quality as much as possible. Such compression is referred to as lossy (as opposed to lossless), since the original data cannot be recovered exactly. Most modern codecs take advantage of correlation among video frames and transmit the differences between a current frame and a predicted frame compactly represented by prediction parameters. The predicted frame is usually similar to the original one. The residual information, i.e., prediction error, is transmitted together with the prediction parameters. The error signal usually is much smaller than the original signal and can be compactly represented using spatial transform, such as the discrete cosine transform (DCT) with subsequent quantization of the transform coefficients. The quantized coefficients and the prediction parameters are entropy-encoded to reduce furthermore their redundancy.
Inter-prediction exploits the temporal redundancy by using temporal prediction. Due to high level similarity among consecutive frames, temporal prediction can largely reduce the information required to present the video data. The efficiency of temporal prediction can be further improved by taking into account the object motion in the video sequence. The motion parameters associated with temporal prediction are transmitted to the decoder for reconstructing the temporal prediction. This type of prediction is used in MPEG-like codecs. The process of producing intra-prediction is called motion estimation. Typically, it is performed in a block-wise manner, when many modern codec support motion estimation with blocks of adaptive size.
At the other end, intra-prediction takes advantage of spatial redundancy and predicts portions of a frame (blocks) from neighbor blocks within the same frame. Such prediction is usually aware of spatial structures that may occur in a frame, namely smooth regions and edges. In general, larger block size is more coding efficient for smooth areas and smaller block size is more coding efficient for areas with more texture variations. In the latter case, the prediction based on neighboring blocks can improve coding efficiency and a directional prediction can even further improve the efficiency. Such intra-prediction is used in the recent H.264 Advances Video Codec (AVC) standard.
The actual reduction in the amount of transmitted information is performed in the transmission of the residual. The residual frame is divided into blocks, each of which undergoes a DCT-like transform. The transform coefficients undergo quantization, usually performed by scaling and rounding. Quantization allows represent the coefficients by using less precision, thus reducing the amount of information required. Quantized transform coefficients are transmitted using entropy coding. This type of coding exploits the statistical characteristics of the underlying data.
For color video, often they are represented in the RGB color coordinate. However, in most video coding systems, the encoder usually uses the YCbCr color space because it is a more compact representation. Y is the luminance component (luma) and Cb and Cr are chrominance components (chroma) of the color video. The chroma is typically down-sampled to half frame size in each direction because human eyes are less sensitive to the chroma signals and such format is referred to as the 4:2:0 format.
The performance of a lossy video codec is measured as the tradeoff between the amount of bits required to describe the data and the distortion introduced by the compression, referred to as the rate-distortion (RD) curve. As the distortion criterion, the mean squared error (MSE) is usually used. The MSE is often converted into logarithmic units and represented as the peak signal-to-noise ratio (PSNR),
                    p        =                  10          ⁢                                    log              10                        (                                          y                max                2                            d                        )                                              (        1        )            where d is the MSE and ymax is the maximum allowed value of the luma pixels, typically 255 if the luma data has an 8-bit precision and is represented in the range 0, . . . , 255.
The H.264 AVC is one of the most recent standards in video compression, offering significantly better compression rate and quality compared to the previous MPEG-2 and MPEG-4 standards and targeted to high definition (HD) content. For example, H.264 delivers the same quality as MPEG-2 at a third to half the data rate.
The encoding process can be briefly described as follows: a frame undergoing encoding is divided into non-overlapping macroblocks, each containing 16×16 luma pixels and 8×8 chroma pixels. Within each frame, macroblocks are arranged into slices, where a slice is a continuous raster scan of macroblocks. Each slice can be encoded independently of the other. Main slice types are P and I. An I slice may contain only I macroblocks; a P slice may contain P or I macroblocks. The macroblock type determines the way it is predicted. P refers to inter-predicted macroblocks; such macroblocks are subdivided into smaller blocks. I refers to intra-predicted macroblocks; such macroblocks are divided into 4×4 blocks (the luma component is divided into 16 blocks; each chroma component is divided into 4 blocks). In I mode (intra-prediction), the prediction macroblock is formed from pixels in the neighbor blocks in the current slice that have been previously encoded, decoded and reconstructed, prior to applying the in-loop deblocking filter. The reconstructed macroblock is formed by imitating the decoder operation in the encoder loop. In P mode (inter-prediction), the prediction macroblock is formed by motion compensation from reference frames. The prediction macroblock is subtracted from the current macroblock. The error undergoes transform, quantization and entropy coding. According to the length of the entropy code, the best prediction mode is selected (i.e., the choice between an I or a P macroblock, the motion vectors in case of a P macroblock and the prediction mode in case of an I macroblock). The encoded residual for the macroblock in the selected best mode is sent to the bitstream.
A special type of frame referred to as instantaneous data refresh (IDR) is used as a synchronization mechanism, in which the reference buffers are reset as if the decoder started “freshly” from the beginning of the sequence. IDR frame is always an I-frame. The use of IDR allows, for example, starting decoding a bitstream not necessarily from its beginning. The set of P and I frames between two IDRs is called a group of pictures (GOP). A GOP always starts with an IDR frame. The maximum GOP size is limited by the standard.
The operating point on the RD curve is controlled by the quantization parameter, determining the “aggressiveness” of the residual quantization and the resulting distortion. In the H.264 standard, the quantization parameter is an integer in the range 0, . . . , 51, denoted here by q′. The quantization step doubles for every increment of 6 in q′. Sometimes, it is more convenient to use the quantization step rather than the quantization parameter, computed according to
                    q        =                  0.85          ·                                    2                                                                    q                    ′                                    -                  12                                6                                      .                                              (        2        )            
In the following, the q and q′ are used interchangeably.
Theoretically, optimal resource allocation for a video sequence requires encoding the sequence with different sets of parameters and selecting one achieving the best result. However, such an approach is impossible due to a very large number of possible combinations of parameters, which leads to a prohibitive computational complexity. Suboptimal resource allocation approaches usually try to model some typical behavior of the encoder as function of the encoding parameters. If the model has an analytical expression which can be efficiently computed, the optimization problem can be practically solved using mathematical optimization. However, since the model is only an approximate behavior of the encoder, the parameters selected using it may be suboptimal. The main difference between existing encoders is the decision process carried out by the bitrate controller that produces the encoder control parameters. Usually, the encoder parameters are selected in a way to achieve the best tradeoff between video quality and bitrate of the produced stream. Controllers of this type are referred to as RDO.
Parameters controlled by the bitrate controller typically include: frame type and reference frame or frames if the frame is a P-frame on the sequence level, and macroblock type and quantization parameter for each macroblock on the frame level. For this reason, it is natural and common to distinguish between two levels of bitrate control: sequence- and frame-level. The sequence-level controller is usually responsible for the frame type selection and allocation of the bit budget for the frame, and the frame-level controller is responsible for selection of the quantization parameter for each macroblock within the frame.
In a conventional coding system, the sequence level control is very limited. The frame type of a frame in a sequence is usually based on its order in a GOP according to a pre-determined pattern. For example, the IBBPBBP . . . pattern is often used in MPEG standard where B is a bi-directional predicted frame. The GOP may be formed by partitioning the sequence into GOP of fixed size. However, when a scene change is detected, it may trigger the start of a new GOP for better coding efficiency. The bit rate allocated to each frame usually is based on a rate control strategy. Such resource allocation for sequence encoding only exercises limited resource allocation and there is more room to improve. Based on the discussion presented here, there is a need for an optimal resource allocation for video sequence as well as for video frame. The current invention addresses the resource allocation for video sequence.