A video encoder can be used to encode one or more frames of an image sequence into digital information. This digital information may then be transmitted to a receiver, where the image or the image sequence can then be re-constructed (decoded). The transmission channel itself may include any of a number of possible channels for transmission. For example, the transmission channel might be a radio channel or other means for wireless broadcast, coaxial Cable Television cable, a GSM mobile phone TDMA channel, a fixed line telephone link, or the Internet. This list of transmission means is only illustrative and is by no means meant to be all-inclusive.
Various international standards have been agreed upon for video encoding and transmission. In general, a standard provides rules for compressing and encoding data relating to frames of an image. These rules provide a way of compressing and encoding image data to transmit less data than the viewing camera originally provided about the image. This reduced volume of data then requires less channel bandwidth for transmission. A receiver can re-construct (or decode) the image from the transmitted data if it knows the rules that the transmitter used to perform the compression and encoding. The H.264 standard minimizes redundant transmission of parts of the image, by using motion compensated prediction of macroblocks from previous frames.
Video compression architectures and standards, such as MPEG-2 and JVT/H.264/MPEG 4 Part10/AVC, encode a macroblock using only either an intraframe (“intra”) coding or an interframe (“inter”) coding method for the encoding of each macroblock. For interframe motion estimation/compensation, a video frame to be encoded is partitioned into non-overlapping rectangular, or most commonly, square blocks of pixels. For each of these macroblocks, the best matching macroblock is searched from a reference frame In a predetermined search window according to a predetermined matching error criterion. Then the matched macroblock is used to predict the current macroblock, and the prediction error macroblock is further processed and transmitted to the decoder. The relative shifts in the horizontal and vertical directions of the reference macroblock with respect to the original macroblock are grouped and referred to as the motion vector (MV) of the original macroblock, which is also transmitted to the decoder. The main aim of motion estimation is to predict a macroblock such that the difference macroblock obtained from taking a difference of the reference and current macroblocks produces the lowest number of bits in encoding.
For intra coding, a macroblock (MB) or a sub-macroblock within a picture is predicted using spatial prediction methods. For inter coding, temporal prediction methods (i.e. motion estimation/compensation) are used. Generally, inter prediction (coding) methods are usually more efficient than intra coding methods. In the existing architectures/standards, specific picture or slice types are defined which specify or restrict the intra or inter MB types that can be encoded for transmission to a decoder. In intra (I) pictures or slices, only intra MB types can be encoded, while on Predictive (P) and Bi-predictive (B) pictures or slices, both intra and inter MB types may be encoded.
An I-picture or I-slice contains only intra coded macroblocks and does not use temporal prediction. The pixel values of the current macroblock are first spatially predicted from their neighboring pixel values. The residual information is then transformed using a N×N transform (e.g., 4×4 or 8×8 DCT transform) and then quantized.
B-pictures or B-slices, introduce the concept of bi-predictive (or in a generalization multiple-prediction) inter coded macroblock types, where a macroblock (MB) or sub-block is predicted by two (or more) interframe predictions. Due to bi-prediction, B pictures usually tend to be more efficient in coding than both I and P pictures.
A P-picture or B-picture may contain different slice types, and macroblocks encoded by different methods. A slice can be of I (Intra), P (Predicted), B (Bi-predicted), SP (Switching P), and SI (Switching I) type.
Intra and Inter prediction methods have been used separately, within video coding architectures and standards such as MPEG-2 and H.264. For intra coded macroblocks, available spatial samples within the same frame or picture are used to predict current macroblocks, while in inter prediction, temporal samples within other pictures or other frames, are instead used. In the H.264 standard, two different intra coding modes exist: a 4×4 intra mode which performs the prediction process for every 4×4 block within a macroblock; and a 16×16 intra mode, for which the prediction is performed for the entire macroblock in a single step.
Each frame of a video sequence is divided into so-called “macroblocks”, which comprise luminance (Y) information and associated (potentially spatially sub-sampled depending upon the color space) chrominance (U, V) information. Macroblocks are formed by representing a region of 16×16 image pixels in the original image as four 8×8 blocks of luminance (luma) information, each luminance block comprising an 8×8 array of luminance (Y) values; and two spatially corresponding chrominance components (U and V) which are sub-sampled by a factor of two in the horizontal and vertical directions to yield corresponding arrays of 8×8 chrominance (U, V) values.
In 16×16 spatial (intra) prediction mode the luma values of an entire 16×16 macroblock are predicted from the pixels around the edges of the MB. In the 16×16 Intra prediction mode, the 33 neighboring samples immediately above and/or to the left of the 16×16 luma block are used for the prediction of the current macroblock, and that only 4 modes (0 vertical, 1 horizontal, 2 DC, and 3 plane prediction) are used.
FIG. 1 illustrates the intraframe (intra) prediction sampling method for the 4×4 intra mode in the H.264 standard of the related art. The samples of a 4×4 luma block 110 to be intra encoded containing pixels “a” through “p” in FIG. 1 are predicted using nearby pixels “A” through “M” in FIG. 1 from neighboring blocks. In the decoder, samples “A” through “M” from previous macroblocks of the same picture/frame typically have been already decoded and can then used for prediction of the current macroblock 110.
FIG. 2 illustrates, for the 4×4 luma block 110 of FIG. 1 the nine intra prediction modes labeled 0, 1, 3, 4, 5, 6, 7, and 8. Mode 2 is the ‘DC-prediction’. The other modes (1, 3, 4, 5, 6, 7, and 8) represent directions of predictions as indicated by the arrows in FIG. 2.
The intra macroblock types that are defined in the H.264 standard are as follows:
TABLE 1Intra Macroblock typesMbPartPredModeIntra16 × 16CodedBlockCodedBlockmb_typeName of mb_type(mb_type, 0)PredModePatternChroPatternLuma0I_4 × 4Intra_4 × 4NANANA1I_16 × 16_0_0_0Intra_16 × 160002I_16 × 16_1_0_0Intra_16 × 161003I_16 × 16_2_0_0Intra_16 × 162004I_16 × 16_3_0_0Intra_16 × 163005I_16 × 16_0_1_0Intra_16 × 160106I_16 × 16_1_1_0Intra_16 × 161107I_16 × 16_2_1_0Intra_16 × 162108I_16 × 16_3_1_0Intra_16 × 163109I_16 × 16_0_2_0Intra_16 × 1602010I_16 × 16_1_2_0Intra_16 × 1612011I_16 × 16_2_2_0Intra_16 × 1622012I_16 × 16_3_2_0Intra_16 × 1632013I_16 × 16_0_0_1Intra_16 × 16001514I_16 × 16_1_0_1Intra_16 × 16101515I_16 × 16_2_0_1Intra_16 × 16201516I_16 × 16_3_0_1Intra_16 × 16301517I_16 × 16_0_1_1Intra_16 × 16011518I_16 × 16_1_1_1Intra_16 × 16111519I_16 × 16_2_1_1Intra_16 × 16211520I_16 × 16_3_1_1Intra_16 × 16311521I_16 × 16_0_2_1Intra_16 × 16021522I_16 × 16_1_2_1Intra_16 × 16121523I_16 × 16_2_2_1Intra_16 × 16221524I_16 × 16_3_2_1Intra_16 × 16321525I_PCMNANANANA
FIG. 3 depicts a current macroblock 310 to be inter coded in a P-frame or P-slice using temporal prediction, instead of spatial prediction, by estimating a motion vector (i.e., MV, Motion Vector) between the best match (BM) among the blocks of two pictures (301 and 302). In inter coding, a current block 310 in the current frame 301 is predicted from a displaced matching block (BM) in the previous frame 302. Every inter coded block (e.g., 310) is associated with a set of motion parameters (motion vectors and a reference index ref_idx), which provide to the decoder a corresponding location within the reference picture (302) associated with ref_idx from which all pixels in the block 310 can be predicted. The difference between the original block (310) and its prediction (BM) is compressed and transmitted along with the displacement motion vectors (MV). Motion can be estimated independently for either 16×16 macroblock or any of its sub-macroblock partitions: 16×8, 8×16, 8×8, 8×4, 4×8, 4×4. An 8×8 macroblock partition is known as a sub-macroblock (or subblock). Hereinafter, the term “block” generally refers to a rectangular group of adjacent pixels of any dimensions, such as a whole 16×16 macroblock and/or a sub-macroblock partition. Only one motion vector (MV) per sub-macroblock partition is allowed. The motion can be estimated for each macroblock from different frames either in the past or in the future, by associating the macroblock with the selected frame using the macroblock's ref_idx.
A P-slice may also contain intra coded macroblocks. The intra coded macroblocks within a P-slice are compressed in the same way as the intra coded macroblocks in an I-slice. Inter coded blocks are predicted using motion estimation and compensation strategies.
If all the macroblocks of an entire frame are encoded and transmitted using intra mode, it is referred to as transmission of an ‘INTRA frame’ (I-Frame or I-Picture). An INTRA frame therefore consists entirely of intra macroblocks. Typically, an INTRA frame must be transmitted at the start of an image transmission, when the receiver as yet holds no received macroblocks. If a frame is encoded and transmitted by encoding some or all of the macroblocks as inter macroblocks, then the frame is referred to as an ‘INTER frame’. Typically, an INTER frame comprises less data for transmission than an INTRA frame. However, the encoder decides whether a particular macroblock is transmitted as an intra coded macroblock or an inter coded macroblock, depending on which is most efficient.
Every 16×1 6 macroblock to be inter coded in a P-slice may be partitioned into 16×8, 8×16, and 8×8 partitions. A sub-macroblock may itself be partitioned into 8×4, 4×8, or 4×4 sub-macroblock partition. Each macroblock partition or sub-macroblock partition in H.264 is assigned to a unique motion vector. Inter coded Macroblocks and macroblock partitions have unique prediction modes and reference indices. It is not allowed in the current H.264 standard for inter and intra predictions to be selected and mixed together in different partitions of the same macroblock. In the H.264/AVC design adopted in February 2002, the partitioning scheme initially adopted from Wiegand et al included support of switching between intra and inter on a sub-macroblock (8×8 luma with 4×4 chroma) basis. This capability was later removed in order to reduce decoding complexity.
In P-pictures and P-slices, the following additional block types are defined:
TABLE 2Inter Macroblock types for P slicesName ofNumMbPartMbPartPredModeMbPartPredModeMbPartWidthMbPartHeightmb_typemb_type(mb_type)(mb_type, 0)(mb_type, 1)(mb_type)(mb_type)0P_L0_16 × 161Pred_L0NA16161P_L0_L0_16 × 82Pred_L0Pred_L01682P_L0_L0_8 × 162Pred_L0Pred_L08163P_8 × 84NANA884P_8 × 8ref04NANA88InferredP_Skip1Pred_L0NA1616
FIG. 4 illustrates the combination of two (temporal) predictions for inter coding a macroblock in a B-Picture or B-Slice.
As illustrated in FIG. 4, for a macroblock 410 to be inter coded within B-pictures or B-slices, instead of using only one “Best Match” (BM) predictor (prediction) for a current macroblock, two (temporal) predictions (BML0 and BML1) are used for the current macroblock 410, which can be averaged together to form a final prediction. In a B-picture or B-slice, up to two motion vectors (MVL0 and MVL1), representing two estimates of the motion, per sub-macroblock partition are allowed for temporal prediction. They can be from any reference pictures (List 0 Reference and List 1 Reference), subsequent or prior. The average of the pixel values in the Best Matched blocks (BML0 and BML1) in the (List 0 and List 1) reference pictures are used as the predictor. This standard also allows weighing the pixel values of each Best Matched block (BML0 and BML1) unequally, instead of averaging them. This is referred to as a Weighted Prediction mode and is useful in the presence of special video effects, such as fading. A B-slice also has a special mode—Direct mode. The spatial methods used in MotionCopy skip mode, and the Direct mode are restricted only on the estimation of the motion parameters and not of the macroblocks (pixels) themselves, and no spatially adjacent samples are used. In Direct mode the motion vectors for a macroblock are not explicitly sent.
The following macroblock types are defined for use in B-pictures and B-slices:
TABLE 3Inter Macroblock types for B slicesName ofNumMbPartMbPartPredModeMbPartPredModeMbPartWidthMbPartHeightmb_typemb_type(mb_type)(mb_type, 0)(mb_type, 1)(mb_type)(mb_type)0B_Direct_16 × 16NADirectNA881B_L0_16 × 161Pred_L0NA16162B_L1_16 × 161Pred_L1NA16163B_Bi_16 × 161BiPredNA16164B_L0_L0_16 × 82Pred_L0Pred_L01685B_L0_L0_8 × 162Pred_L0Pred_L08166B_L1_L1_16 × 82Pred_L1Pred_L11687B_L1_L1_8 × 162Pred_L1Pred_L18168B_L0_L1_16 × 82Pred_L0Pred_L11689B_L0_L1_8 × 162Pred_L0Pred_L181610B_L1_L0_16 × 82Pred_L1Pred_L016811B_L1_L0_8 × 162Pred_L1Pred_L081612B_L0_Bi_16 × 82Pred_L0BiPred16813B_L0_Bi_8 × 162Pred_L0BiPred81614B_L1_Bi_16 × 82Pred_L1BiPred16815B_L1_Bi_8 × 162Pred_L1BiPred81616B_Bi_L0_16 × 82BiPredPred_L016817B_Bi_L0_8 × 162BiPredPred_L081618B_Bi_L1_16 × 82BiPredPred_L116819B_Bi_L1_8 × 162BiPredPred_L181620B_Bi_Bi_16 × 82BiPredBiPred16821B_Bi_Bi_8 × 162BiPredBiPred81622B_8 × 84NANA88inferredB_SkipNADirectNA88
In B-slices, as shown in the above table, the two temporal predictions are always restricted to using the same block type.
Deblocking filters, and Overlapped Block Motion Compensation (OBMC) use some spatial correlation. According to these methods, the reconstructed pixels, after prediction and the addition of the associated residual, are spatially processed/filtered depending upon their mode (intra or inter), position (MB/block edges, internal pixels etc), motion information, associated residual, and the surrounding pixel difference. This process can considerably reduce blocking artifacts and improve quality, but on the other hand can also increase complexity considerably (especially within the decoder). This process also may not always yield the best results and it may itself introduce additional blurring on the edges.