The statements in this section merely provide background information related to the present disclosure and may not constitute the prior art.
As information and communication technologies including an internet are developed, the use of video communication is increased as well as voice communication. Conventional communication based on text is not sufficient to satisfy various demands of consumers. Accordingly, multimedia services capable of accommodating diverse types of information such as texts, videos, music, etc. are increasingly provided. Multimedia data requires a storage medium having a large capacity due to its large amount or size, and requires a wide bandwidth for a transmission. Therefore, it is necessary to use a compression coding technique to transmit the multimedia data including text, video, and audio data.
A basic principle of compressing a data includes a process of removing a factor of the data redundancy. The data can be compressed by removing the spatial redundancy corresponding to the repetition of the same color or object in an image, the temporal redundancy corresponding to the repetition of the same note in an audio or a case where there is little change of an adjacent frame in a dynamic image, or the psychological vision redundancy considering a fact that human's visual and perceptive abilities are insensitive to a high frequency.
As a video compressing method, H.264/AVC recently draws more interests for its improved compression efficiency over MPEG-4 (Moving Picture Experts Group-4).
Being a digital video codec standard with a very high data compression rate, H.264 is also referred to as MPEG-4 part 10 or AVC (Advanced Video Coding). This standard is a result from constructing a Joint Video Team and performing the standardization together by VCEG (Video Coding Experts Group) of ITU-T (International Telecommunication Union Telecommunication Standardization Sector) and MPEG of ISO/IEC (International Standardization Organization/International Electrotechnical Commission).
Various methods are proposed to improve the compression efficiency in a compression encoding, and include methods using a temporal prediction and a spatial prediction as representative methods.
The temporal prediction corresponds to a scheme of performing a prediction with reference to a reference block 122 of another frame 120 temporally adjacent in predicting a current block 112 of a current frame 110, as shown in FIG. 1. That is, in inter-predicting the current block 112 of the current frame 110, the temporally adjacent reference frame 120 is searched for, and the reference block 122, which is the most similar to the current block within the reference frame 120, is searched for. Here, the reference block 122 is a block, which can predict the current block 112 best, and a block, which has the smallest SAD (Sum of Absolute Difference) from the current block 112, can be the reference block 122. The reference block 122 becomes a predicted block of the current block 112, and a residual block is generated by subtracting the reference block 122 from the current block 112. The generated residual block is encoded and inserted in a bitstream. In this event, a relative difference between a position of the current block in the current frame 110 and a position of the reference block 122 in the reference frame 120 corresponds to a motion vector 130, and the motion vector 130 is encoded like the residual block. The temporal prediction is also referred to as an inter prediction or an inter frame prediction.
The spatial prediction corresponds to a prediction of obtaining a predicted pixel value of a target block by using a reconstructed pixel value of a reference block adjacent to the target block in one frame, and is also referred to as a directional intra prediction (hereinafter, simply referred to as an “intra prediction”) or an inter frame prediction. H.264 defines an encoding/decoding using the intra prediction.
The intra prediction corresponds to a scheme of predicting values of a current subblock by copying one subblock in a determined direction based on adjacent pixels located in an upper direction and a left direction with respect to the subblock and encoding only a differential. According to the intra prediction scheme based on the H.264 standard, a predicted block for a current block is generated based on another block having a prior coding order. Further, a coding is a value generated by subtracting the predicted block from the current block. A video encoder based on the H.264 standard selects a prediction mode having the smallest difference between the current block and the predicted block for each block from prediction modes.
The intra prediction based on the H.264 standard defines nine prediction modes shown in FIG. 2 in consideration of the prediction directivity and positions of adjacent pixels used for generating predicted pixel values of a 4×4 luma block and an 8×8 luma block. The nine prediction modes are divided into a vertical prediction mode (prediction mode 0), a horizontal prediction mode (prediction mode 1), a DC prediction mode (prediction mode 2), a diagonal_down_left prediction mode (prediction mode 3), a diagonal_down_right prediction mode (prediction mode 4), a vertical_right prediction mode (prediction mode 5), a horizontal_down prediction mode (prediction mode 6), a vertical_left prediction mode (prediction mode 7), and a horizontal_up prediction mode (prediction mode 8) according to their prediction directions. Here, the DC prediction mode uses an average value of eight adjacent pixels.
Further, four intra prediction modes are used for an intra prediction processing for a 16×16 luma block, wherein the four intra prediction modes are the vertical prediction mode (prediction mode 0), the horizontal prediction mode (prediction mode 1), the DC prediction mode (prediction mode 2), and the diagonal_down_left prediction mode (prediction mode 3). In addition, the same four intra prediction modes are used for an intra prediction processing for an 8×8 chroma block.
FIG. 3 illustrates a labeling example for the nine intra prediction modes shown in FIG. 2. In this event, a predicted block (an area including a to p) for the current block is generated using samples (A to M) decoded in advance. When E, F, G, and H cannot be decoded in advance, E, F, G, and H can be virtually generated by copying D in their positions.
FIG. 4 is a diagram for illustrating the nine prediction modes shown in FIG. 2 by using FIG. 3. Referring to FIG. 4, a predicted block in a case of the prediction mode 0 predicts pixel values in the same vertical line as the same pixel value. That is, in pixels of the predicted block, pixel values are predicted from pixels, which are most adjacent to a reference block located in an upper side of the predicted block. Reconstructed pixel values of an adjacent pixel A are set to predicted pixel values of a pixel a, a pixel e, a pixel i, and a pixel m in a first column of the predicted block. Further, in the same way, pixel values of a pixel b, a pixel f, a pixel j, and a pixel n in a second column are predicted from reconstructed pixel values of an adjacent pixel B, pixel values of a pixel c, a pixel g, a pixel k, and a pixel o in a third column are predicted from reconstructed pixel values of an adjacent pixel C, and pixel values of a pixel d, a pixel h, a pixel l, and a pixel p in a fourth column are predicted from reconstructed pixel values of an adjacent pixel D. As a result, a predicted block in which predicted pixel values of each column correspond to pixel values of the pixel A, pixel B, pixel C, and pixel D is generated as shown in FIG. 5A.
Further, a predicted block in a case of the prediction mode 1 predicts pixel values in the same horizontal line as the same pixel value. That is, in pixels of the predicted block, pixel values are predicted from pixels, which are most adjacent to a reference block located in a left side of the predicted block. Reconstructed pixel values of an adjacent pixel l are set to predicted pixel values of a pixel a, a pixel b, a pixel c, and a pixel d in a first row of the predicted block. Further, in the same way, pixel values of a pixel e, a pixel f, a pixel g, and a pixel h in a second row are predicted from reconstructed pixel values of an adjacent pixel J, pixel values of a pixel i, a pixel j, a pixel k, and a pixel l in a third row are predicted from reconstructed pixel values of an adjacent pixel K, and pixel values of a pixel m, a pixel n, a pixel o, and a pixel p in a fourth row are predicted from reconstructed pixel values of an adjacent pixel L. As a result, a predicted block in which predicted pixel values of each column correspond to pixel values of the pixel l, pixel J, pixel K, and pixel L is generated as shown in FIG. 5B.
Furthermore, pixels of a predicted block in a case of the prediction mode 2 are equally replaced with an average of pixel values of upper pixels A, B, C, and D, and left pixels I, J, K, and L.
Meanwhile, pixels of a predicted block in a case of the prediction mode 3 are interpolated in a lower-left direction at an angle of 45° between a lower-left side and an upper-right side of the predicted block, and pixels of a predicted block in a case of the prediction mode 4 are extrapolated in a lower-right direction at an angle of 45° between a lower-left side and an upper-right side of the predicted block. Further, pixels of a predicted block in a case of the prediction mode 5 are extrapolated in a lower-right direction at an angle of about 26.6° (width/height=½) with respect to a vertical line. In addition, pixels of a predicted block in a case of the prediction mode 6 are extrapolated in a lower-right direction at an angle of about 26.6° with respect to a horizontal line, pixels of a predicted block in a case of the prediction mode 7 are extrapolated in a lower-left direction at an angle of about 26.6° with respect to a vertical line, and pixels of a predicted block in a case of the prediction mode 8 are interpolated in an upper direction at an angle of about 26.6° with respect to a horizontal line.
The pixels of the predicted block can be generated from a weighted average of the pixels A to M of the reference block decoded in advance in the prediction mode 3 to 8. For example, in a case of the prediction mode 4, the pixel d located in an upper right side of the predicted block can be estimated as shown in Equation (1). Here, a round( ) function is a function of rounding off to the nearest whole number.d=round(B/4+C/2+D/4)  Equation 1
Meanwhile, in a 16×16 prediction model for luma components, there are 4 modes including the prediction mode 0, prediction mode 1, prediction mode 2, and prediction mode 3 as described above.
In a case of the prediction mode 0, pixels of the predicted block are interpolated from upper pixels, and, in a case of the prediction mode 1, the pixels of the predicted block are interpolated from left pixels. Further, in a case of the prediction mode 2, the pixels of the predicted block are calculated as an average of the upper pixels and the left pixels. Lastly, in a case of the prediction mode 3, a linear “plane” function suitable for the upper pixels and the left pixels is used. The prediction mode 3 is more suitable for an area in which the luminance is smoothly changed.
As described above, in the H.264 standard, pixel values of the predicted block are generated in directions corresponding to respective modes based on adjacent pixels of the predicted block to be currently encoded in the respective prediction modes except for the DC mode.
Meanwhile, a prediction error between a predicted value predicted by each prediction mode and a current pixel value is transform-encoded using an integer transform scheme based on a DCT (Discrete Cosine Transform). An integer transform in the 4×4 unit is applied when a 4×4 intra prediction mode and a 16×16 intra prediction mode are used according to a block size, and an inter transform in the 8×8 unit is applied when an 8×8 intra prediction mode is used.
The Video Coding Expert Group of the ITU-T has further developed the H.264 standard recently, so that the predictive encoding performance is further improved. Specifically, the predictive encoding performance is improved by increasing the number of intra prediction modes through further diversifying the directivity of a pixel value used in the intra prediction and introducing a scheme of adding weights of two intra prediction modes in “Improvement of Bidirectional Intra Prediction”, ITU-T SG16/Q.6 Doc. VCEG-AG08, October 2007 by Shiodera Taichiro, Akiyuki Tanizawa, Takeshi Chujoh, and Tomoo Yamakage. However, this scheme has a disadvantage of greatly increasing an amount of operations for finding an optimal mode according to the increase of the number of intra prediction modes, which should be considered, up to 4 times and thus increasing an amount of additional information for encoding the increased prediction modes.
Unlike a conventional research for improving the intra mode encoding through performing an exact intra encoding, a transform scheme of using different KLT (Karhunen-Loeve Transform) based directivity bases is proposed based on the idea that there still remains the spatial redundancy in a prediction error after the intra prediction and such a spatial redundancy has a high correlation with an intra prediction direction in “Improved Intra Coding”, ITU-T SG16/Q.6 Doc. VCEG-AG11, October 2007 by Yan Ye and Marta Karczewicz. The transform scheme has significantly improved the intra mode encoding performance by performing an adaptive prediction error encoding according to the intra prediction mode without any addition information by using KLT transform bases trained through several experiment images. However, the transform scheme has a disadvantage that a generated transform base cannot have the optimal energy concentration efficiency for various video sequences having different characteristics or other partial local images having different characteristics within one sequence.