Video coding systems are widely deployed to reduce the bandwidth needed to represent, store and transmit digital video signals. Commonly used video coding systems include block-based video coding systems, region based video coding systems, and wavelet based video coding systems among others.
The block-based video coding system is one type of widely used video coding system used to compress digital video signals. Examples of such coding systems include international video coding standards such as the MPEG1/2/4, H.264 (see reference 1), the VC-1 (see reference 2) standard, coding systems from On2 Technologies such as VP-6, VP-7 and VP-8, the Dirac codec, and the Theora video codec among others.
FIG. 1 shows a block diagram of a generic block-based video coding system. An input video signal (102) is processed block by block. A commonly used video block unit consists of N×M pixels where usually N=M=16 (also commonly referred to as a “macroblock”). For each input video block, spatial prediction (160) and/or temporal prediction (162) may be performed. Spatial prediction uses the already coded neighboring blocks in the same video frame/slice to predict the current video block. Spatial prediction is also commonly referred to as “intra prediction.” Spatial prediction may be performed using video blocks or regions of various sizes; for example, H.264/AVC allows block sizes of 4×4, 8×8, and 16×16 pixels for spatial prediction of the luminance component of the video signal. On the other hand, temporal prediction uses information from previously coded, usually neighboring, video frames/slices to predict the current video block. Temporal prediction is also commonly referred to as “inter prediction” and/or “motion prediction.” Similar to spatial prediction, temporal prediction can also be performed on video blocks or regions of various sizes and shapes; for example, for the luminance component, H.264/AVC allows block based inter prediction using block sizes such as 16×16, 16×8, 8×16, 8×8, 8×4, 4×8, and 4×4. Multiple reference and multi-hypothesis prediction, where plural references are available for prediction, which can further be combined linearly or non-linearly, can also be considered.
After prediction, the prediction block is subtracted from the original video block at summer (116). The residual block is transformed at transform unit (104) and quantized at quantization unit (106). The quantized residual transform coefficients are then sent to entropy coding unit (108) to be entropy coded to further reduce bit rate. Various entropy coding methods or modes may be applied. For example, H.264/AVC allows two entropy coding modes, the Context Adaptive Variable Length Coding (CAVLC) mode and the Context Adaptive Binary Arithmetic Coding (CABAC) mode. The entropy coded residual coefficients are then packed to form part of an output video bitstream (120).
The quantized transform coefficients are inverse quantized at inverse quantization unit (110) and inverse transformed at inverse transform unit (112) to obtain the reconstructed residual block. The reconstructed residual block is then added to the prediction video block at summer (126) to form a reconstructed video block. The reconstructed video block may go through additional filtering at loop filter unit (166) to reduce certain coding artifacts. For example, the in-loop deblocking filter as in H.264/AVC is an example of loop filtering performed at unit (166) that removes and/or reduces blocking artifacts that may be visually objectionable. After loop filtering, the reconstructed video block is stored in reference picture store (164) for use as prediction of other video blocks in the same video frame/slice and/or in future (in terms of coding order) video frames/slices.
The encoder shown in FIG. 1 uses a mode decision and general encoder control logic unit (180) to choose the best coding mode for the current video block, usually based on certain pre-defined criterion, e.g., the Lagrangian rate distortion costJ(λ)=D(r)+λ·r  (1)where r is the rate or number of bits needed to code the video block, D is the distortion (e.g., SSE or Sum of Squared Error, SAD or Sum of Absolute Differences etc) between the reconstructed video block and the original video block, and λ is the Lagrangian lambda factor (see reference 3). Joint optimization using multiple other parameters beyond rate and distortion, such as power consumption, implementation complexity, and/or implementation cost, can also be considered. The rate r can be the true rate required for encoding but can also be an estimate; distortion D can be based on a variety of distortion models, some of which may also account for impact on subjective quality perception and the human visual system. After mode decision, the coding mode (intra or inter coding), prediction information (spatial prediction mode and transform type if intra coded, motion partitioning, bi-predictive or uni-predictive motion compensated prediction if inter coded, etc), and other motion information (reference frame index, motion vectors, illumination change parameters etc) are sent to entropy coding unit (108) to be further compressed to reduce bit rate. The entropy coded mode and motion information are also packed to form part of video bitstream (120).
Transform and quantization combined may reduce the bit rate associated with coding the prediction residual block. Quantization of the transformed residual coefficients introduces video quality loss. The degree of quantization is controlled by the value of a quantization parameter (QP) and directly reflects the degree of quality loss. That is, a higher QP value is usually associated with more aggressive quantization and consequently worse reconstructed video quality; and a lower QP value means less aggressive quantization and therefore usually better reconstructed video quality. Some video coding systems (e.g., the H.264/AVC video coding standard) allow macroblock level QP variation. For such systems, depending on the characteristics of the input video block, the encoder may choose to apply more or less quantization to obtain either higher compression or better visual quality of the reconstructed video signal. Specifically, the QP value used to quantize a given input video block may be chosen by the encoder in order to optimize the rate-distortion cost function given in equation (1) or any other predefined criterion; and the optimal QP value selected by the encoder may be signaled to the decoder as a part of the video bitstream (120).
Furthermore, at quantization unit (106), during quantization of transform coefficients, the encoder may apply a more sophisticated quantization process (such as the trellis quantization process used in JPEG2000 system (see references 4 and 11)) instead of a simple scalar quantization process to achieve better coding performance. Other tools also often used as part of the quantization process include quantization matrices and quantization offsets. A given quantization matrix specifies the quantization scaling factor (degree of quantization) that the encoder and decoder wish to apply to each coefficient in a block. For example, for a 4×4 residual block, the corresponding quantization matrix would be a 4×4 matrix, with each matrix element specifying the quantization scaling factor for each corresponding coefficient in the block. An example quantization matrix that may be used on an inter-coded 4×4 residual block of the luminance component is given below. Different quantization matrices may be used for other types of residual blocks, such as 4×4 chroma residual block, 8×8 luma residual block, 8×8 chroma residual block, etc, since the characteristics of these residual blocks could be different. Intra-coded and inter-coded blocks may also use different quantization matrices. In addition to quantization matrices, quantization offsets corresponding to different coefficient positions can also be considered: they can be considered as only part of the encoding process, and/or they can also be considered as part of the decoding process by signaling such information to the decoder and by accounting for these parameters during reconstruction.
      QUANT_INTER    ⁢    _    ⁢    4    ×    4    ⁢    _LUMA    =      [                            17                          17                          16                          16                                      17                          16                          15                          15                                      16                          15                          15                          15                                      16                          15                          15                          14                      ]  
The Rate Distortion Optimized Quantization (RDOQ) algorithm (see references 5, 6 and 10) currently available in the JM H.264/AVC reference software (see reference 7) and the new JMKTA software (see reference 8), used in the development of next generation video coding standards, include two components: 1) macroblock level QP variation, and 2) trellis-like quantization of residual coefficients. Using macroblock QP variation, the encoder tries to determine the best QP value for each macroblock given a rate-distortion optimized criterion and signals the decision using the delta QP syntax element supported in H.264/AVC. Furthermore, the RDOQ algorithm (see references 5 and 6) also applies rate distortion optimized decision during quantization of residual transform coefficients at quantization unit (106). Specifically, for each non-zero coefficient having value v≠0, the encoder chooses to quantize the given coefficient to one of up to three possible values, ceiling(v), floor(v), and 0, based on rate-distortion optimized decision process.
FIG. 2 shows a flow chart of an example coding mode and QP decision process of the RDOQ algorithm that the video encoder may use at the mode decision and general control logic unit (180). An example of a predefined criterion that the encoder may use to perform mode decision is the Lagrangian rate-distortion cost in equation (1).
According to FIG. 2, the encoder mode decision unit (180) examines each QP value (202) and each valid coding mode (204) in order to minimize the rate distortion cost of encoding the current video block. For each QP and each coding mode, the prediction block or blocks and the residual block or blocks are formed (206). The residual block is then transformed and quantized (208), and the resulting rate, distortion, and Lagrangian cost associated with the current video block are calculated (210). The current coding mode and QP parameters are marked and stored (214) if they bring reduction in rate distortion cost; eventually, the optimal coding parameters (coding mode and QP) for the current video block are output to the entropy coding unit (108) to be entropy coded and packed into the video bitstream (120).
As can be seen from FIG. 2, because of the additional QP loop (202) used in RDOQ, the overall encoding process takes longer. If brute-force search (a search that tries exhaustively all possible combinations of modes, motion vector, reference, QP, and coefficient adjustment, among others) for the optimal coding mode and the optimal QP is used, then the encoding process may become significantly slower. Overall, the encoding time for not using and using RDOQ may be approximated by equations (2) and (3), respectively:TRDOQoff≈M·t  (2)TRDOQon≈N·M·(1+δ)·t  (3)where t is the average time used to evaluate one coding mode for each block, N is the number of QP values tested, M is the number of coding modes, and δ is the additional coding time incurred by the use of trellis-like quantization process relative to the use of a non-trellis based scalar quantization process at box (208).
It should be noted that a number of approximations and simplifications are used to derive equations (2) and (3). For example, it is assumed that the time needed to evaluate each coding mode is the same (in reality some modes are more complex to evaluate, and different entropy coding processes can also have very different impact on the evaluation process). It is also assumed that the time needed to perform quantization is the same regardless of the value of the QP (in reality, smaller QPs result in more non-zero coefficients and hence a longer quantization process). It is also assumed that a basic mode decision process (e.g., exhaustive mode decision) is used. Given these assumptions and simplifications, the increase in encoding time due to using the RDOQ algorithm is therefore approximately equal to:TRDOQon/TRDOQoff≈N·(1+δ)
Assuming that 5 QP values (N=5) are evaluated for each macroblock, and the time overhead due to a more sophisticated quantization process at step (208) (e.g., the trellis-like quantization used in RDOQ) is δ=20%, then the overall encoding time increase due to the RDOQ algorithm is approximately 6×. Therefore, while the RDOQ algorithm may bring significant coding performance gains, the significantly prolonged encoding time (if a brute force search approach is used) may render it unusable for most video coding applications.
Some speedup algorithms for QP decision already exist in the JM and JMKTA software (see references 7 and 8). They include the following techniques:
According to a first technique, the optimal intra and inter prediction parameters (intra modes, intra prediction vs. bi-predictive vs. uni-predictive motion compensation, motion partition size, reference frame indices, motion vectors, etc) may remain nearly the same regardless of the QP value. Therefore, motion search and intra partition search can be performed only once during the coding loops of FIG. 2. This significantly reduces the complexity due to intra prediction or motion estimation in units such as the spatial prediction unit (160) and the motion prediction unit (162).
According to a second technique, during the QP loop (202), not all QP values need to be evaluated. For example, the QP values of neighboring video blocks may be used to predict the QP value for the current block; and only QP values within a narrow range of the predicted QP based on the neighboring QPs may be evaluated during the QP loop (202) in FIG. 2, see also reference 9. This reduces the number of QPs evaluated for each video block during QP loop (202), hence encoding time.
Denote the optimal coding mode chosen for the first QP value as best_mode_first. According to a third technique, when evaluating subsequent QP values, only best_mode_first is evaluated within the coding mode loop (204), while all other coding modes are disallowed (see reference 7). In this way, full mode decision is performed only once for the first QP value. For all subsequent QP values, prediction, transform, quantization, and calculation of rate-distortion costs are performed for only one coding mode (best_mode_first). However, since the overall best coding mode (best_mode_overall) may not emerge at the first QP value, coding performance may be penalized significantly.
According to a fourth technique, early termination of the QP loop (202) may be invoked when certain conditions are met. For example, if the best coding mode found so far contains no non-zero residual coefficients (coded_block_pattern=0), then the QP loop may be terminated early (see reference 7).
Simulations show that the currently available RDOQ speedup techniques may not always offer the best coding time vs. coding performance tradeoff. For example, they may offer insufficient encoding time reduction and/or they may incur too much coding performance penalty.