In multimedia, one of the requisite technical elements in multimedia applications is image compression technology for maximizing the use of available storage and transmission resources. Representative image compression techniques include MPEG-1/2/4, H.261/262/263, and H.264, wherein the latest standard, H.264, is a high-performance compression standard for providing compression efficiency of more than twice that of MPEG-2. Since H.264 can provide high quality video of a digital television level at a bit rate of less than 2 Mbps (mega bits per second), H.264 is used in various multimedia application fields, such as video streaming through a third generation wireless network, portable multimedia broadcasting such as Digital Multimedia Broadcasting (DMB), and Internet Protocol-Television (IP-TV) based on a current generation network such as Asymmetric Digital Subscribers' Loop (ADSL).
Basically, H.264 is a hybrid codec as in the conventional MPEG and H-series video compression standards and is based on motion estimation/compensation and transformation/quantization techniques. However, in H.264, motion compensation for much more various variable block sizes than in the conventional standards is selected, motion estimation from various reference images can be performed, the degree of freedom of an encoder is significantly increased by introducing new techniques unseen in the conventional standards, such as ¼-pixel basis motion compensation, pixel region intra estimation, integer transform in which the mismatch problem has been solved, and in-loop deblocking filter, to the H.264 standard, and thus if a ‘well-designed encoder’ is used, high compression performance can be provided. A ‘well-designed encoder’ is an encoder almost similar to an actual encoder for performing compression in terms of performance by searching various compression methods provided by the H.264 standard and their calculation results and selecting one method having the highest compression performance. However, this ‘well-designed encoder’ basically has very high calculation complexity proportional to the degree of freedom. Referring to an example of conventional encoders used in the H.264 video standard, it will now be described in more detail that the ‘well-designed encoder’ has very high calculation complexity.
An H.264 codec receiving every frame of video performs encoding in a frame by frame basis, decodes its result, stores the decoding result in a Decoded Picture Buffer (DPB), and uses the decoding result as a reference image for motion estimation when a subsequently input frame is encoded. Several decoded images can be stored in the DPB, and the maximum size of the decoded images depends on profile and level. Encoding of a current input image is achieved for each non-overlapping 16×16 basic unit called a macroblock. Actual encoding is achieved by performing motion estimation/restoration and mode decision for each macroblock and performing integer transformation and quantization of a difference between an original image and an image motion-restored at the optimal mode. Each encoded macroblock is dequantized and inverse transformed, and therefore a difference image is restored. Each decoded macroblock is generated by adding the difference image and a motion-restored image, and the decoded macroblocks are gathered. The gathered decoded macroblocks are loop-filtered in a slice by slice basis, and these results are stored in the DPB. By doing this, the total process of slice unit encoding ends. Unlike the conventional standards, a slice can be defined using various flexible structures in a single frame in H.264. However, in order to aid easy understanding of the configuration and operation of the present invention, a single slice is defined as the entire of a single frame input to an encoder in the following description.
FIG. 1 illustrates coding modes in H.264, wherein FIG. 1A illustrates inter macroblock available modes, FIG. 1B illustrates 16×16 intra macroblock available modes, and FIG. 1C illustrates 4×4 intra macroblock available modes. There are 21 macroblock coding modes defined in H.264. That is, the inter macroblock available modes illustrated in FIG. 1A include 5 motion compensation modes, i.e. SKIP, 16×16, 8×16, 16×8, and 8×8 motion compensation modes, wherein 3 sub-modes, i.e. 8×4, 4×8, and 4×4 sub-modes, exist for each 8×8 sub-block in the 8×8 motion compensation mode. When a macroblock is intra encoded, one of 4 16×16 intra prediction modes (FIG. 1B) or one of 9 4×4 intra prediction modes (FIG. 1C) is selected.
In order to encode a macroblock using one of the 21 illustrated available coding modes, an encoder must select a mode having the highest encoding efficiency by comparing encoding results obtained using the 21 coding modes to each other. For example, an H.264 standard encoder may obtain an optimal motion vector for each of the 7 inter macroblock available modes excluding the SKIP mode. The optimal motion vector among candidate motion vectors is obtained by minimizing Equation 1 below.Jmotion=SAD+λmotion·Rmotion   (Equation 1)
Here, λmotion denotes a Lagrangian coefficient for motion estimation, and Rmotion denotes the number of bits needed to encode a candidate motion vector (mvx, mvy). SAD (Sum of Absolute Difference) denotes the sum of absolute values of differences between pixels of a motion-compensated macroblock generated using the candidate motion vector and pixels of a macroblock of an original image. If it is assumed that a candidate motion vector is (mvx, mvy), SAD is defined by Equation 2.
                    SAD        =                              ∑                          x              ,              y                                ⁢                                                                                  f                  t                                ⁡                                  (                                      x                    ,                    y                                    )                                            -                                                                    f                    ^                                                        t                    -                    n                                                  ⁡                                  (                                                            x                      -                      mvx                                        ,                                          y                      -                      mvy                                                        )                                                                                                    (                  Equation          ⁢                                          ⁢          2                )            
Here, ft(x,y) denotes a pixel located on an x row and a y column of a current input frame, and {circumflex over (f)}t-n(x,y) denotes a pixel located on an x row and a y column of an nth frame in a DPB.
Thus, in the 16×16 motion compensation mode, the SAD calculation for each candidate motion vector is performed through subtraction and absolute calculation of each of 16×16 pixels of a macroblock and addition calculation of the absolute values, and in the other motion compensation modes having a different block size, since these calculations are performed for pixels corresponding to each block size, the amount of SAD calculation becomes less. However, since blocks belonging to the same macroblock may have different optimal motion vectors, motion estimation must be performed for each block. Kinds of candidate motion vectors generally depend on the size of a search window, and if a search window having a 32×32 size is used, a total of 65×65 candidate motion vectors, i.e. (−32, −32), (−32, −31), (−32, −30), . . . , (−32, 32), (−31, −32), (−31, −31), . . . , (−31, 32), . . . , (32, 32), exist. That is, in order to find out an optimal motion vector in the 16×16 motion compensation mode, the SAD calculation for 16×16 pixels must be performed 65×65 times, and a candidate motion vector minimizing Equation 1 must be obtained from among the 65×65 candidate motion vectors. In order to find out an optimal motion vector in the 16×8 motion compensation mode, the SAD calculation for 16×8 pixels of each 16×8-block must be performed 65×65 times. If an encoder uses several reference images, the whole calculations for the optimal motion estimation must be repeatedly performed for the reference images, and a candidate motion vector minimizing Equation 1 must be obtained for each mode and each block. These optimal motion vectors obtained for the available modes are updated by performing an additional search of a few locations adjacent to each optimal motion vector, and this updating process is performed by calculating Equation 1 at the few locations, wherein pixel values at the few locations are obtained using a 6-tap Low Pass Filter (LPF) and a 2-tap LPF.
After estimating an optimal motion vector for each of the inter macroblock available modes from the calculation results, an optimal coding mode is decided through comparison with the intra macroblock available modes. The optimal coding mode is a coding mode minimizing Equation 3 for the 21 available modes illustrated in FIG. 1.Jmod e=SSD+λmod e·Rmod e   (Equation 3)
Here, λmod e denotes a Lagrangian coefficient for mode decision, and Rmod e denotes the number of bits used to encode a macroblock in a current candidate mode. SSD (Sum of Squared Distortion) denotes a value obtained by adding squares of differences between pixels of a decoded macroblock and pixels of a corresponding macroblock of an original image. If it is assumed that ft(x,y) denotes a pixel located on an x row and a y column of an original image, and {circumflex over (f)}t(x,y) denotes a pixel located on an x row and a y column of a decoded image, SSD is defined by Equation 4.
                    SSD        =                              ∑                          x              ,              y                                ⁢                                    [                                                                    f                    t                                    ⁡                                      (                                          x                      ,                      y                                        )                                                  -                                                                            f                      ^                                        t                                    ⁡                                      (                                          x                      ,                      y                                        )                                                              ]                        2                                              (                  Equation          ⁢                                          ⁢          4                )            
Thus, mode decision is performed by obtaining Rmod e that is the number of encoding bits by encoding a current macroblock to be encoded using the 21 available modes illustrated in FIG. 1, obtaining SSD of Equation 4 by decoding the encoded macroblock, and comparing cost functions of Equation 3. Herein, when encoding and decoding are performed in the inter macroblock available modes, an optimal motion vector of each mode obtained from motion estimation is used, wherein in the SKIP mode, motion vectors to be used are calculated from already encoded adjacent macroblocks.
The above-described motion estimation and mode decision method in an encoder according to the H.264 video standard requires a large amount of calculation and is the most complex component in the encoder, accounting for 60˜70% of the entire encoder complexity. Thus, for development of a “well-designed fast encoder”, high speed of the most complex component of the encoder and minimization of image quality due to high speed must be considered.
Representative conventional schemes for quickly performing this complex H.264 mode decision process will now be described. The common basic idea of these schemes is to decrease the complexity of an encoder by combining a motion estimator and a mode decision unit in order not to perform calculation for the motion estimation or mode decision in specific modes predicted that will not often occur among the available modes. According to C. Sampath Kannangara, Iain E. G. Richardson, Maja Bystrom, Jose R. Solera, Yafan Zhao, Andrew MacLennan, and Robert Cooney (“Low-complexity skip prediction for H.264 through Lagrangian cost estimation”, IEEE Trans. Circuits and Syst. for Video Technol., vol. 16, no. 2, pp. 202-208, February 2006), the SKIP mode is examined first of all in a first phase of coding mode decision, and if it is determined that the result shows that encoding possibility in the SKIP mode is high, high speed is achieved by excluding all mode decision related calculations performed and encoding a current object to be encoded in the SKIP mode. That is, it is determined by first obtaining a cost function represented by Equation 3 when a current macroblock to be encoded is encoded in the SKIP mode and comparing the cost function to a specific threshold whether calculations related to motion estimation and mode decision for the other available modes are performed. However, since this method considers only the SKIP mode while sequential calculations are performed as usual for the other coding modes, an increase of performance for high speed of an encoder is limited.
In order to overcome this limitation, a method of determining by deciding priority of all coding modes including the SKIP mode, sequentially calculating the cost function represented by Equation 3 according to the decided priority, and comparing the results to a series of adaptive thresholds whether calculations for the other modes are performed has been used.
The conventional techniques pursuing fast mode decision for all coding modes including the SKIP mode can be largely divided into two categories. The first category includes methods of reducing a total of calculations by performing a specific calculation to decide candidate modes suitable for a current macroblock and performing the comparison of the cost function represented by Equation 3 for only the decided candidate modes, and Qionghai Dai, Dongdong Zhu, and Rong Ding (“Fast mode decision for inter prediction in H.264”, in Proc. IEEE ICIP, October 2004, vol. 1 pp. 119-122); Hyungjoon Kim and Yucel Altunbasak (“Low-complexity macroblock mode selection for H.264/AVC encoders”, in Proc. IEEE ICIP, October 2004, vol. 2, pp. 765-768); Andy C. Yu and Graham R. Martin (“Advanced block size selection algorithm for inter frame coding in H.264/MPEG-4 AVC”, in Proc. IEEE ICIP, October 2004, vol. 1, pp. 95-98) correspond to the first category.
In the case of Qionghai Dai et. al, high speed of an encoder is pursued by performing the motion estimation and mode decision calculations for an image having ¼ resolution of an original image to be encoded, selecting specific candidate modes based on an optimal mode result obtained in the low-resolution image, performing mode decision in the original image having full resolution for only the selected modes. A candidate mode selection table used in this method is illustrated in Table 1.
TABLE 1Mode obtained in a low-resolution imageMacroblock modeInter mode smaller than 8 × 8Candidate modesSKIPSKIP, P16 × 16I16 × 16I16 × 16I4 × 4I16 × 16, I4 × 4P8 × 8SKIPSKIP, P16 × 16P8 × 8P16 × 16, P8 × 8P8 × 4P16 × 8, P8 × 4P4 × 8P8 × 16, P4 × 8P4 × 4P8 × 8, P4 × 8,P8 × 4, P4 × 4
In more detail, by applying a 7-tap LPF to an original resolution image in the horizontal and vertical directions, a ¼-resolution image down-sampled by ½-resolution in each of the horizontal and vertical directions is obtained. For each macroblock of the obtained low-resolution image, the motion estimation and mode decision calculations of an illustrated encoder provided in the H.264 standard are performed for all of the intra macroblock available modes and the inter macroblock available modes having less than the 8×8 size. While the mode decision calculation is performed, the motion estimation and mode decision of an original resolution macroblock are performed by selecting two modes having the least encoding cost function value represented by Equation 3 and selecting candidate modes illustrated in Table 1. In Table 1, I denotes an intra mode, and P denotes an inter mode. That is, P16×16 denotes a 16×16-sized inter mode, and 14×4 denotes a 4×4-sized intra mode. This method is a technique for decreasing the calculation complexity of a well-designed encoder by effectively limiting candidate modes having a high possibility of being used for the encoding. However, since only the inter macroblock available modes having less than the 8×8 size can be searched due to pre-processing of a low-resolution image, there are many candidate modes to be searched to decide an actual coding mode in original resolution as illustrated in Table 1, and thus, the increase of performance is limited.
On the other hand, in the case of Hyungjoon Kim et. al, a total amount of calculation is decreased by performing a fast mode search in original resolution based on Sum of Absolute Transformed Difference (SATD) for all candidate modes and performing an actual mode decision calculation for only a few optimal candidate modes selected according to the search result. In more detail, in this method, optimal motion vectors making Equation 1 minimized in the inter macroblock available modes are obtained, and an encoding cost function represented by Equation 5 is calculated for each of all available modes of inter macroblocks and intra macroblocks illustrated in FIG. 1.JSATD=SATD+λmod e·Rest   (Equation 5)
Here, SATD denotes a value obtained by performing Hadamard transform of a difference between a motion-estimated or intra-predicted macroblock and an original macroblock to be encoded and summing absolute values of transform coefficients, and Rest denotes the number of bits used to encode a macroblock header and a motion vector. A coding mode of a current macroblock is decided by selecting N candidate modes minimizing Equation 5 from among the 21 available modes and performing actual mode decision represented by Equation 3 for the N candidate modes. Since SATD can be performed using only a series of simple integer calculations and Rest can be easily implemented in a table referring method, the method disclosed by Hyungjoon Kim et. al can perform fast mode decision without damaging image quality when compared to the full mode decision method represented by Equation 3. However, in this technique, there are problems that the number (N) of optimal candidate modes cannot be adaptively changed according to a video characteristic and a high calculation load of a motion estimator associated with the mode decision cannot be collectively minimized.
In the case of Andy C. Yu et. al, candidate modes to be searched are limited by measuring complexity or activity of a current unit to be encoded and motion consistency of encoding unit sub-blocks and comparing the measured result to an experimental threshold. In more detail, a complexity ratio Rc represented by Equation 6 is obtained for a current macroblock to be encoded.
                              R          c                =                              ln            ⁡                          (                              E                AC                            )                                            ln            ⁡                          (                              E                max                            )                                                          (                  Equation          ⁢                                          ⁢          6                )            
Here, EAC denotes total energy of high frequency (AC coefficient) coefficients of the current macroblock, and Emax denotes the maximum variance of the current macroblock.
The obtained complexity ratio Rc is compared to the experimental threshold, and if Rc is less than the experimental threshold, the current macroblock is classified to a homogeneous area, otherwise a heterogeneous area. If the current macroblock corresponds to the homogeneous area and a macroblock of a previous frame, which exists at the same position of the current macroblock, is encoded with not less than 8×8 size, the mode decision represented by Equation 3 is performed by limiting candidate modes of the current macroblock to SKIP, P16×16, and all the available intra modes, otherwise, 4 motion vectors minimizing Equation 1 are estimated for 8×8-sized blocks belonging to the current macroblock. A continuous motion macroblock or discontinuous motion macroblock is distinguished by obtaining the maximum absolute value of differences between the 4 estimated optimal motion vectors and comparing the maximum absolute value to the threshold. If the current macroblock is distinguished as a continuous motion macroblock, mode decision satisfying Equation 3 is performed for SKIP, P16×16, P16×8, P8×16, and all the available intra macroblock modes, otherwise a coding mode of the macroblock to be encoded is set by performing mode decision satisfying Equation 3 for all the 21 available modes. This method is a technique for decreasing the complexity of an encoder by properly limiting candidate modes for the mode decision to be performed through Equation 3 using the complexity and motion consistency of a current macroblock to be encoded. However, information regarding adjacent macroblocks or already encoded previous macroblocks is not used by reflecting only the characteristic of a macroblock to be encoded, the intra macroblock available modes cannot be limited, and candidate modes are basically selected by only distinguishing inter macroblock available modes having a large sub-block size from inter macroblock available modes having a small sub-block size, and thus an improvement effect on performance is limited.
Although the conventional techniques for fast H.264 coding mode decision described above implement high speed for all H.264 coding modes including the SKIP mode, these techniques perform the mode calculation in the determined sequence regardless of an image characteristic or coding characteristic, and consequently an optimal mode is obtained by calculating a large number of modes (on the contrary, the present invention that will be described later has the fundamental difference from these conventional techniques in that a relatively high complexity improvement effect can be obtained by deciding candidate modes through a statistical characteristic based on an encoding history).
Meanwhile, the second category of the conventional techniques performing fast mode decision for all coding modes including the SKIP mode includes methods of reducing a total of calculations by removing cases, which do not often occur as an optimal mode, from candidate modes using a global statistical characteristic of the optimal mode as disclosed by Lidong Xu and Xinggang Lin (“Fast mode decision for inter frames in H.264/AVC”, in Proc. IEEE ISCIT, October 2005, vol. 1, pp. 433-436); Dongming Zhang, Yanfei Shen, Shouxun Lin, and Yongdong Zhang (“Fast inter frame encoding based on modes pre-decision in H.264”, in Proc. IEEE International Conf. on Multimedia and Expo, ICME, July 2005, pp. 530-533). First, Lidong Xu et. al analyzed an occurrence frequency change of each mode including various-sized segmented spaces using a statistical characteristic of results obtained through calculation for deciding all modes in H.264. A search sequence of each mode was determined through the analysis result, and by comparing a value of a result cost function according to mode selection with a determined threshold while the calculation for mode decision is performed in the determined sequence, it is determined whether the mode decision ends early or a search of a specific mode is not performed. This method will now be described in more detail.
The cost function Jmod e of Equation 3 is obtained for the SKIP mode having the least amount of calculation of motion estimation and is called Jmod e(SKIP). The obtained Jmod e(SKIP) is compared to a threshold T1, and if Jmod e(SKIP)<T1, the SKIP mode is decided as an optimal mode of a current macroblock, and all subsequent mode decision calculations are avoided. If Jmod e(SKIP)≧T1, Jmod e(SKIP) is compared to a second threshold T2, and if Jmod e(SKIP)<T2, a mode search for the intra macroblock available modes is not performed. If the current macroblock to be encoded is not decided as the SKIP mode in the first process described above, an optimal motion vector of Equation 1 is estimated for the P16×16 mode, and a cost function of Equation 3 is calculated using the estimated motion vector and is called Jmod e(P16×16). By introducing a third threshold T3, if Jmod e(SKIP)<T2 and Jmod e(SKIP)<Jmod e(P16×16)+T3, a mode having a smaller value from among Jmod e(P16×16) and Jmod e(SKIP) is decided as an optimal mode for encoding the current macroblock, and all subsequent mode decision calculations are stopped. This decision has an improvement effect due to selection of the P16×16 mode as compared to the SKIP mode; however, cases having a no large value are selected, and this is because the possibility is high that even if a mode having smaller segmented spaces is adopted, an improvement effect in terms of cost function is not high.
If the SKIP mode or the P16×16 mode is not decided since the above conditions are not satisfied, an optimal motion vector of Equation 1 is estimated for each of the P16×8 and P8×16 modes having a next smaller segmented space, cost functions of Equation 3 using the estimated results are obtained, and a smaller value of the cost functions is called Jmod e(P16). If the obtained value of Jmod e(P16) is greater than Jmod e(SKIP) or Jmod e(P16×16), this means that a cost function increases as a mode includes a smaller segmented space, and thus a mode having a smaller value from among Jmod e(P16×16) and Jmod e(SKIP) is decided as an optimal mode for encoding the current macroblock, and all subsequent mode decision calculations are stopped. In a case that does not correspond to any of the cases described above, motion vector estimation satisfying Equation 1 is performed for the P8×8 mode and all other inter macroblock available modes having a smaller segmented space than the P8×8 mode, and a cost function of Equation 3 is calculated using the motion vector estimation result. This result is compared to cost functions of Equation 3 using all intra macroblock available modes, and an optimal mode is decided by performing an all-mode search as well as before. Here, one attention point is that whether the intra macroblock available modes are used is determined according to the result of comparison between Jmod e(SKIP) and T2 in the early stage of this method.
Although this method significantly improved the complexity of H.264 encoding by effectively using the statistical characteristic of a coding mode that most background portions existing on video screens are encoded in a mode including a large segmented space, since a characteristic of temporally varying video screens is not adaptively used (this means that a coding mode search sequence for all macroblocks is fixed by a general statistical characteristic), a change per scene according to a degree of high speed is very high, and since a fixed threshold without adaptability is used regardless of a significant change of the degree of high speed and decoded image quality according to the set threshold, video to be processed and a gain according to an encoding environment are not uniform.
Meanwhile, Dongming Zhang et. al performed statistical optimal mode occurrence frequency analysis similar to Lidong Xu et. al for a case of using a plurality of reference images and considerably limited candidate modes to be used in second and further reference images using the analysis result. This method will now be described in more detail.
By calculating a cost function of Equation 1 for the SKIP mode and comparing the cost function to a threshold, it is determined whether mode decision calculation is stopped. If the SKIP mode is not decided, an optimal mode minimizing a cost function represented by Equation 3 is decided for all available modes using a first reference image as in an illustration of an H.264 encoder, and the following intermediate variables are set:
BetterIntraMode—mode having a cost function value of Equation 3 from among I16×16 and I4×4;
BestMode—mode having the minimum cost function of Equation 3 from among all available modes; and
CostBestMo—a cost function value of Equation 3 in BestMode.
Using the set intermediate variables, candidate modes for performing a mode decision search in a subsequent reference image are set as described below. This candidate mode setting method reflects a statistical characteristic of an optimal coding mode.
When BestMode is P16×8, P16×8 and P8×8 are set as candidate modes, and if BetterModeIntra is 14×4, all available modes including a segmented space smaller than 8×8 are added to the candidate modes.
When BestMode is P8×16, P8×16 and P8×8 are set as candidate modes, and if BetterModeIntra is 14×4, all available modes including a segmented space smaller than 8×8 are added to the candidate modes.
In the other cases, P8×8 is set as a candidate mode, and if BetterModeIntra is I16×16, P16×16 is added as a candidate mode.
An optimal coding mode is selected by calculating motion estimation represented by Equation 1 and encoding cost function represented by Equation 3 for the set candidate modes, the selected optimal coding mode is called BestModeNew, and its cost function value is called CostBestModeNew. If BestModeNew>BestMode, a current macroblock is encoded in BestMode, and no further reference image search is performed. If BestModeNew≦BestMode, BestMode and CostBestMode are updated to BestModeNew and CostBestModeNew, and candidate mode setting and optimal coding mode search for a subsequent reference image are performed. By recursively repeating the above-described process, fast coding mode decision for all reference images is performed.
Although this method could significantly improved the calculation complexity of an H.264 encoder for mode decision through statistical optimal mode occurrence frequency analysis in an encoding environment using a plurality of reference images, as in the case of Lidong Xu et. al, since a candidate mode decision method is fixed by generalizing statistical characteristics existing in several video screens, a characteristic of temporally varying video screens cannot be not adaptively used.