The invention relates to the field of video compression, or more particularly, video encoder control and resource allocation to optimize video quality while maintaining the utilized resources under given constraints.
Video data can be thought of as a three-dimensional array of color pixels. Two dimensions serve as spatial (horizontal and vertical) directions, and the third dimension represents the time domain. A frame is a still picture corresponding to a single point in time. Video data often contains spatial and temporal redundancy. Spatial redundancy refers to the fact that small portions of a frame are usually similar to neighboring portions within the same frame. Temporal redundancy is expressed in two subsequent frames being similar up to displacement of objects due to motion.
Video compression takes advantage of these redundancies, trying to reduce the size of the video data, while maintaining the content without substantial visible loss. Such compression is referred to as lossy (as opposed to lossless), since the original data cannot be recovered exactly. Most modern codecs (coders/decoders) operate by constructing a prediction of the video frame by using the redundant information. The prediction parameters are usually much more compactly represented than the actual pixels. The predicted frame is usually similar to the original one, yet, differences may appear at some pixels. For this purpose, the residual (prediction error) is transmitted together with the prediction parameters. The error is also usually compactly represented using spatial transform, like the discrete cosine transform (DCT) with subsequent quantization of the transform coefficients. The quantized coefficients and the prediction parameters are then entropy-encoded to further reduce their redundancy.
Inter-prediction accounts for the most significant reduction of the video data size in video codecs. It takes advantage of temporal redundancy, predicting a frame from nearby frames (referred to as reference frames) by moving parts of these frames, thus trying to compensate for the motion of the objects in the scene. The description of the motion (motion vectors) is related to the parameters of the prediction which are transmitted. This type of prediction is used in MPEG-like codecs. The process of producing intra-prediction is called motion estimation. Typically, motion estimation is performed in a block-wise manner. Many modern codecs support motion estimation with blocks of adaptive size.
At the other end of the process, intra-prediction takes advantage of spatial redundancy and predicts portions of a frame (blocks) from neighboring blocks within the same frame. Such prediction usually takes account of spatial structures that may occur in a frame, such as smooth regions and edges. In the latter case, the prediction is directional. Such type of prediction is used in the recent H.264 Advanced Video Codec (AVC) standard and does not appear in previous versions of the MPEG standard.
The actual reduction in the amount of transmitted information is performed in the transmission of the residual. The residual frame is divided into blocks, each of which undergoes a DCT-like transform. The transform coefficients undergo quantization, usually performed by scaling and rounding. Quantization allows the coefficients represented using less precision, thus reducing the amount of information required. Quantized transform coefficients are transmitted using entropy or arithmetic coding by taking into account the statistics of the quantized coefficients. This type of coding is lossless and utilizes further redundancy that may exist in the transform coefficients.
The encoder usually operates with video represented in the YCbCr color space, where the achromatic Y channel is called the luma, and the chromatic channels (Cb and Cr) are called chroma. The chroma is typically downsampled to half frame size in each direction, which is referred to as the 4:2:0 format. Since the human eye is mostly sensitive to achromatic artifacts, major effort is put to try to achieve better quality of the luma rather than the chroma.
The performance of a lossy video codec is measured as the tradeoff between the amount of bits required to describe the data and the distortion introduced by the compression, referred to as the rate-distortion (RD) curve. The RD curve is a curve that prescribes the lowest possible distortion for a given rate. As the distortion criterion, the mean squared error (MSE) of the luma channel is usually used. The MSE is often converted into logarithmic units and represented as the peak signal-to-noise ratio (PSNR),
                    p        =                  10          ⁢                                          ⁢                                    log              10                        (                                          y                max                2                            d                        )                                              (        1        )            (here d is the MSE and ymax is the maximum allowed value of the luma pixels, typically 255 if the luma data has an 8-bit precision and is represented in the range from 0 to 255.
H.264 AVC is one of recent video compression standards, offering significantly better compression rate and quality compared to the previous MPEG-2 and MPEG-4 standards, and targeted to high definition (HD) content. For example, H.264 delivers the same quality as MPEG-2 at a third to half the data rate. The encoding process can be summarized as follows: a frame undergoing encoding is divided into non-overlapping macroblocks, each containing 16×16 luma pixels and 8×8 chroma pixels (in the most widely used 4:2:0 format). Within each frame, macroblocks are arranged into slices, where a slice is typically a continuous raster scan of macroblocks. Each slice within a frame can be encoded and decoded independently of the others. Main slice types are P, B and I. An I slice may contain only I macroblocks; a P slice may contain P or I macroblocks. The macroblock type determines the way it is predicted. P refers to inter-predicted macroblocks. Macroblocks predicted in this way are usually sub-divided into smaller blocks (of sizes 8×16, 16×8, 8×8, 8×4, 4×8 or 4×4). I refers to intra-predicted macroblocks. Macroblocks predicted in this way are divided into 4×4 blocks (the luma component is divided into 16 blocks; each chroma component is divided into 4 blocks). In I mode (intra-prediction), the prediction macroblock is formed from pixels in the neighboring blocks in the current slice that have been previously encoded, decoded and reconstructed, prior to applying the in-loop deblocking filter. The reconstructed macroblock is formed by implementing the decoder in the encoder loop. In P mode (inter-prediction), the prediction macroblock is formed by motion compensation from reference frames. The prediction macroblock is subtracted from the current macroblock. The error undergoes transform, quantization and entropy coding. According to the resulting bit rate and the corresponding distortion, the best prediction mode is selected (i.e., the choice between an I or a P macroblock, the motion vectors in case of a P macroblock and the prediction mode in case of an I macroblock). The residual for the macroblock is encoded in the selected best mode accordingly and the compressed output is sent to the bitstream. The B, or bi-predicted slices are similar to P, with the exception that inter-prediction can be performed from two reference frames.
The operating point on the RD curve is controlled by the quantization parameter, determining the “aggressiveness” of the residual quantization and the resulting distortion. In the H.264 standard, the quantization parameter is an integer in the range from 0 to 51, denoted here by q′. The quantization step doubles for every increment of 6 in q′. Sometimes, it is more convenient to use the quantization step rather than the quantization parameter, computed according to
                    q        =                  0.85          ·                                    2                                                                    q                    ′                                    -                  12                                6                                      .                                              (        2        )            
Since only the decoder is standardized in the MPEG standards, many aspects of the way in which the encoder produces a standard-compliant stream are left to the discretion of a specific encoder design. The main difference among existing encoders is the decision process carried out by the bitrate controller that produces the encoder control parameters. Usually, the encoder parameters are selected in a way to achieve the best tradeoff between video quality and bitrate of the produced stream. Controllers of this type are referred to as RDO (rate-distortion optimization).
Parameters controlled by the bitrate controller typically include: frame type and reference frame or frames if the frame is a P-frame at the sequence level, and macroblock type and quantization parameter for each macroblock at the frame level. For this reason, it is natural and common to distinguish between two levels of bitrate control: sequence- and frame-level. The sequence-level controller is usually responsible for the frame type selection and allocation of the bit budget for the frame, and the frame-level controller is responsible for selection of the quantization parameter for each macroblock within the frame.
In a broader perspective, RD optimization is a particular case of the optimal resource allocation problem, in which the available resources (bitrate, computational time, power dissipation, etc.), are distributed in a way that maximizes some quality criterion. Hence, hereinafter we use the term resource allocation referring to RD optimization-type problems discussed here. The resources we consider specifically are computational time and bitrate, but may also include additional resources such as memory use, power dissipation, etc.
Theoretically, optimal resource allocation can be achieved by running the encoding with different sets of parameters and selecting the best outcome. Such an approach is impossible due to a very large number of possible combinations of parameters together with a high complexity of the single encoding process, which result in prohibitive computational complexity. Suboptimal resource allocation approaches usually try to model some typical behavior of the encoder as function of the parameters; if the model has an analytical expression which can be efficiently computed, the optimization problem becomes practically solvable. However, since the model is only an approximate behavior of the encoder, the parameters selected using it may be suboptimal.
Therefore, it becomes apparent that a computationally feasible and accurate method of optimal resource reallocation is very desirable for video compression. This method will achieve the best visual quality for any given resource constraints. As will be shown, the invention provides such a method and related system.