The H.264 video compression coding standard is established jointly by the International Telecommunication Union—Telecommunication Standardization Section (ITU-T), the International Organization for Standardization (ISO) and the Moving Picture Expert Group (MPEG) of the International Electrician Commission (IEC).
Currently, the H.264 has gradually become a predominating standard in multimedia communication, and numerous multimedia real time communication products using the H.264 standard, e.g., a video conferencing, a video telephony, a 3rd Generation (3G) mobile communication terminal, etc., and network streaming media products have emerged successively. It can be said that whether the H.264 can be supported has become a crucial factor of determining product competitiveness in the market field. Especially along with emergence of 3G mobile communication systems and rapid development of the Internet Protocol (IP), video network communication has gradually become one of dominant communication services.
Components and a transport mechanism of a message under the H.264 standard will be described briefly below.
A layered mode is adopted in the H.264 standard to define a video coding layer (VCL) and a network abstraction layer (NAL), and the NAL is designed specifically for network transmission and can be adapted to video transmission over different networks to further improve network affinity. The H.264 introduces an IP packet oriented coding mechanism, which is advantageous to packet transmission over a network, supports streaming media transmission of a video over the network and robust error resilience, especially of accommodating to requirements of wireless video transmission with a high packet loss ratio and serious interference. All of H.264 data to be transmitted, including image data and other messages, is encapsulated into packets of a uniform format for transmission, i.e., Network Abstraction Layer Units (NALU). Each NALU is a variable length character string in bytes of certain syntax elements and includes head information of one byte available for representing a data type, and payload data of several integer bytes. A NALU can carry a coded slice, various types of data segmentations or a set of sequence or image parameters. In order to enhance reliability of data, each frame of images is divided into several slices, each of which is carried in an NALU. A slice is further consisted of several smaller macroblocks and is a minimal processing unit. Generally, slices at corresponding locations in tandem frames are associated with each other, and slices at different locations are independent from each other, so that a code error can be prevented from diffusing between the slices.
H.264 data includes texture data of non-reference frames, sequence parameters, image parameters, Supplemental Enhancement Information (SEI), texture data of reference frames and so on. The SEI is a general designation of information playing an auxiliary role in H.264 video decoding, display and other aspects.
FIG. 1 illustrates a H.264 compression processing framework. A basic H.264 processing unit is a 16×16 macroblock 110, for which advanced techniques, such as multiple frame reference, intra-frame prediction 120, multiple macroblock type, 4×4 integer transform and quantization 130, loop filter 140, ¼-pel accuracy motion estimation prediction, Context-based Adaptive Variable Length Coding (CAVLC), Context Adaptive Binary Arithmetic Coding (CABAC), entropy coding 150, etc., are adopted, therefore, H.264 compression efficiency can be improved to more than a double of that of MPEG-2, H.263 and MPEG-4 ASP.
During establishing the H.264 scalable coding standard, the Joint Video Team (JVT) makes a basic layer compatible with a H.264 Main Profile and uses an algorithm framework of Motion Compensated Temporal Filter (MCTF), so that functions such as spatial scalability, temporal scalability, quality or SNR scalability, complexity scalability, etc., can be implemented very well.
The latest reference model of the Joint Video Team Scalable Video Coding (JVT SVC) is the Joint Scalable Video Model 3 (JSVM3). FIG. 2 illustrates a block diagram of the above SVC algorithm. Input video data (210) is received, 2-dimensional (2D) spatial sampling (220) is performed thereon, and operations such as temporal decomposition (230), motion coding (240), macroblock intra-frame prediction (250), transform/entropy coding (260), etc., are performed in a core decoder.
It shall be noted that a temporal decomposition process can adopt a B frame decomposition based method as illustrated in FIG. 3 or a MCTF decomposition based method as illustrated in FIG. 4, in which the resolution at Layer 0 is an original frame rate, and those at Layers 3, 2, 1 are ½, ¼ and 1/12 of the original frame rate, respectively.
In terms of intra-frame prediction, a H.264 intra-frame prediction method is adopted for a JSVM3 basic layer. A prediction mode, I_BL, in which a macroblock at the present layer is predicated pixel by pixel from a corresponding macroblock at a lower layer, is added based upon a H.264 prediction mode for an enhanced layer. As illustrated in FIG. 5, macroblocks H1, H2, H3, etc., at Layer K+1 are predicated pixel by pixel from corresponding macroblocks H1, H2, H3, etc., at Layer K.
Furthermore, a macroblock residual image at an enhanced layer, i.e., a difference image after subtractive prediction, can be predicated from a residual image of a corresponding macroblock at a basic or lower layer in a similar way to the I_BL.
For spatial scalable coding, the corresponding macroblock at the basic or lower layer has to be subject to an upsampling process in I_BL prediction or residual image prediction at the enhanced layer. Upsampling is a kind of re-sampling for scaling up or down a sampled signal. Assume that original sampling points are located at integer coordinates (0, 1, 2 . . . ) and a distance between new sampling points after re-sampling is denoted by a, then it is referred to as down-sampling if a>1, otherwise upsampling if a<1.
In the prior art, an upsampling filter in I-BL prediction is a relatively complex [1 −5 20 20 −5 1]/32 filter, that in the residual image prediction is a [1 1]/2 filter, and the same filter is adopted for luminance and chrominance images.
In a practical application, the above solution has the following drawback: in an upsampling process in I-BL prediction, the same relatively complex 6-order filter is adopted for a chrominance component as for a luminance component, and consequently calculation complexity in the upsampling process of the chrominance component may be too high.
Furthermore, the [1 1]/2 filter adopted in the upsampling process in residual image predication is too simple, and consequently this may influence coding performance.
That is, in the prior art, when up-sampling a spatial scalable coded video image, the difference between a chrominance component and a luminance component has not been considered, and calculation complexity and coding performance have not been considered comprehensively, thus some problems, for example, high calculation complexity or poor calculation performance, arise in spatial scalable coding of a video image.