H.264/AVC is the latest international video encoding standard jointly developed by ITU-T and ISO/IEC. Comparing to previous video encoding standards, H.264/AVC is designated by JVT as the latest video encoding standard with the highest encoding efficiency and the strongest network adaptability. With the same bit rate, H.264/AVC can achieve better encoding efficiency and best image quality. In particular, comparing to MPEG-4, the encoding performance of H.264/AVC at a low bit rate is significantly improved, and is mostly applicable for the low-bandwidth and high-quality network video application. To achieve better encoding efficiency, H.264/AVC employs various new technologies, and has higher computation complexity comparing to the previous video encoding standards. Thus, real-time encoding in hardware and software becomes more difficult. With respect to a mobile platform, due to the limitation of computing capacity and network bandwidth, real-time video communication on the mobile platform progresses fairly slow. Therefore, to reduce encoding complexity and improve encoding efficiency is very important for real-time transmission and compression of the videos on the mobile platform.
In video encoding, a video sequence is formed by consecutive Group of Picture (GOP). One GOP is a group of consecutive pictures, which usually starts with an I frame (intra-encoded frame) followed by several P frames (predictive-encoded frames) with several B frames (bidirectional-encoded frames) inserted there-between. The length of a GOP may be configured according to different encoding methods. In a general video encoding technology, predictive encoding is first performed on a video sequence and a difference signal between an image pixel and its predicted value is transmitted. By eliminating the space correlation or time correlation, image compression can be achieved. The predictive encoding includes intra-frame prediction encoding and inter-frame prediction encoding, where the intra-frame prediction encoding predicts using pixel values within one frame, and the inter-frame prediction encoding predicts using pixel values in adjacent frames.
In a standard encoding process of H.264/AVC, a currently input image is encoded using a macroblock (for example, a 16×16 pixels) as an encoding unit. When intra-frame encoding is applied, a corresponding intra-frame prediction encoding mode is selected to perform intra-frame prediction, and the difference between the actual pixel values and the predicted pixel values is transformed, quantized and entropy encoded. Later, the entropy encoded bit stream is transmitted to the communication channel. Meanwhile, the encoded bit stream is inverse-quantized and inverse-transformed to reconstruct the residual image. The residual image is later added into the predicted pixel values, and the result is smooth processed via a de-blocking filter and transmitted to a frame memory to be used as a reference image for the next frame encoding. When inter-frame encoding is applied, motion estimation is first performed on an inputted image with respect to a reference frame to obtain a motion vector. Later, the motion compensated residual image along with the motion vector is transmitted to the communication channel after integer transformation, quantization and entropy encoding. Meanwhile, another bit stream is reconstructed in the same way via the de-blocking filter, and transmitted to the frame memory to be used as a reference image for the next frame encoding. In an inter-frame encoding mode, reference objects are one or more reconstructed frames from previously encoded frames.
The inputted image according to H.264/AVC standard may be categorized into I frame, P frame and B frame. In general, the I frame and the P frame are used as reference frames. During encoding, the P frame has only a forward prediction mode, while the B frame has a forward prediction mode, a backward prediction mode and a bidirectional prediction mode. Prediction modes of the I frame are all intra-frame prediction encoding modes, and prediction modes of the P frame and the B frame include intra-frame prediction encoding modes and inter-frame prediction encoding modes, where the inter-frame prediction encoding modes are the majority of the prediction encoding modes.
Inter-frame prediction is a prediction mode using an encoded and reconstructed video frame and based on motion compensation. An image frame that a currently encoded pixel lies in is referred to as a current frame, and an image frame used for prediction is referred to as a reference frame. A 16×16-pixel encoding macroblock can be divided into different sub-blocks and form seven sub-block sizes (including 16×16, 16×8, 8×16, 8×8, 8×4, 4×8, and 4×4) in different dividing modes. One independent motion vector must be provided for each division area. Each motion vector and a dividing mode of the macroblock must be encoded and transmitted. When a dividing mode with a large sub-block size is selected, fewer bits may be used to represent the motion vector and the dividing mode of the macroblock. However, in a detailed area of the image, a residual image after motion compensation using the large sub-block size may have more energy (i.e., error). When a dividing mode with a small sub-block size is selected, the image can be predicted more precisely, and the residual image after motion compensation using the small sub-block size may have less energy. However, more bits are needed to represent the motion vector and the dividing mode of the macroblock.
H.264/AVC encoding standard applies a Direct prediction mode in the B frame, where a prediction motion vector obtained from encoded information is directly used as a motion vector of the current macroblock, therefore motion vector of the macroblock does not need to be encoded. Because the B frame supports bidirectional prediction, two prediction motion vectors pointing to different reference frames may be obtained in the Direct mode. A forward prediction motion vector and a backward prediction motion vector of a time domain Direct mode are computed through motion vectors of the corresponding frames positioned in a temporal order, respectively; and a forward prediction motion vector and a backward prediction motion vector of a space domain Direct mode are computed through motion vectors of the corresponding forward reference frame and backward reference frame positioned in a spatial order, respectively.
In a conventional video encoding method, a first frame in a GOP is generally encoded as an I frame, and a second (1+1) frame to a (1+n)th frame are set as B frames, and the n B frames are cached. An (n+2)th frame is set as a P frame and encoded. Finally, the B frames from the second frame to the (1+n)th frame are successively encoded, and the last frame of each GOP is encoded as a P frame. An example of an encoded sequence with a GOP of a length 7 and n=1 in the conventional art is shown in FIG. 8, where an arrow represents a reference direction.
When a certain image frame is determined as a B frame, during encoding, an optimal block encoding mode needs to be determined for each macroblock. Specifically, it needs to be first determined, according to a prediction motion vector of a current macroblock, whether the current macroblock meets a condition of the Direct mode, if yes, it is further determined whether the current macroblock meets a condition of a Skip mode, and if yes, the Skip mode is selected as the optimal encoding mode. The Skip mode is to directly copy a corresponding pixel of a reference frame according to the prediction motion vector, where a motion vector difference and a pixel residual are not written into a bit stream. If the current macroblock meets the condition of the Direct mode but does not meet the condition of the Skip mode, a cost when the macroblock is encoded in a Direct_16×16 mode is computed. If the macroblock does not meet the condition of the Direct mode, the computation of the cost in the Direct_16×16 mode is skipped. Further, motion estimation is performed for each macroblock dividing mode of the current macroblock, which includes computing inter-frame prediction encoding costs in the macroblock dividing modes, and computing intra-frame prediction encoding costs for different prediction directions in the macroblock dividing modes. The costs in all these modes are compared, and a mode with the smallest cost is selected as an optimal block encoding mode.
As described above, in the conventional video encoding mode, when a mode is selected for encoding a B frame, in the optimal block encoding mode of the current macroblock, motion vectors of the inter-frame modes need to be obtained through motion estimation, and cost values of the modes need to be computed through intra-frame prediction encoding in different prediction directions. A mode with the smallest cost value is selected as the optimal encoding mode through comparing the cost values of the modes. Finally, a motion vector residual, a pixel value residual, and a mode bit (a flag bit indicating an encoding mode of the current macroblock) are encoded together into a bit stream. Therefore, in the conventional video encoding mode, the computation of mode selection in encoding a B frame involves very high complexity. In the entire encoding process, the mode selection is mostly time-consuming, which results in very high computation complexity and a large amount of computation in the entire video encoding process, and reduces video encoding efficiency.