Compressed digital video has been widely used in various applications such as video streaming over digital networks and video transmission over digital channels. Very often, a single video content may be delivered over networks with different characteristics. For example, a live sport event may be carried in a high-bandwidth streaming format over broadband networks for premium video service. In such applications, the compressed video usually preserves high resolution and high quality so that the video content is suited for high-definition devices such as an HDTV or a high resolution LCD display. The same content may also be carried through cellular data network so that the content can be watch on a portable device such as a smart phone or a network-connected portable media device. In such applications, due to the network bandwidth concerns as well as the typical low-resolution display on the smart phone or portable devices, the video content usually is compressed into lower resolution and lower bitrates. Therefore, for different network environment and for different applications, the video resolution and video quality requirements are quite different. Even for the same type of network, users may experience different available bandwidths due to different network infrastructure and network traffic condition. Therefore, a user may desire to receive the video at higher quality when the available bandwidth is high and receive a lower-quality, but smooth, video when the network congestion occurs. In another scenario, a high-end media player can handle high-resolution and high bitrate compressed video while a low-cost media player is only capable of handling low-resolution and low bitrate compressed video due to limited computational resources. Accordingly, it is desirable to construct the compressed video in a scalable manner so that videos at different spatial-temporal resolution and/or quality can be derived from the same compressed bitstream.
The joint video team (JVT) of ISO/IEC MPEG and ITU-T VCEG standardized a Scalable Video Coding (SVC) extension of the H.264/AVC standard. An H.264/AVC SVC bitstream can contain video information from low frame-rate, low resolution, and low quality to high frame rate, high definition, and high quality. This single bitstream can be adapted to various applications and displayed on devices with different configurations. Accordingly, H.264/AVC SVC is suitable for various video applications such as video broadcasting, video streaming, and video surveillance to adapt to network infrastructure, traffic condition, user preference, and etc.
In SVC, three types of scalabilities, i.e., temporal scalability, spatial scalability, and quality scalability, are provided. SVC uses multi-layer coding structure to realize the three dimensions of scalability. A main goal of SVC is to generate one scalable bitstream that can be easily and rapidly adapted to the bit-rate requirement associated with various transmission channels, diverse display capabilities, and different computational resources without trans-coding or re-encoding. An important feature of the SVC design is that the scalability is provided at a bitstream level. In other words, bitstreams for deriving video with a reduced spatial and/or temporal resolution can be simply obtained by extracting Network Abstraction Layer (NAL) units (or network packets) from a scalable bitstream that are required for decoding the intended video. NAL units for quality refinement can be additionally truncated in order to reduce the bit-rate and the associated video quality.
In SVC, spatial scalability is supported based on the pyramid coding scheme as shown in FIG. 1. In a SVC system with spatial scalability, the video sequence is first down-sampled to obtain smaller pictures at different spatial resolutions (layers). For example, picture 110 at the original resolution can be processed by spatial decimation 120 to obtain resolution-reduced picture 111. The resolution-reduced picture 111 can be further processed by spatial decimation 121 to obtain further resolution-reduced picture 112 as shown in FIG. 1. In addition to dyadic spatial resolution, where the spatial resolution is reduced to half in each level, SVC also supports arbitrary resolution ratios, which is called extended spatial scalability (ESS). The SVC system in FIG. 1 illustrates an example of spatial scalable system with three layers, where layer 0 corresponds to the pictures with lowest spatial resolution and layer 2 corresponds to the pictures with the highest resolution. The layer-0 pictures are coded without reference to other layers, i.e., single-layer coding. For example, the lowest layer picture 112 is coded using motion-compensated and Intra prediction 130.
The motion-compensated and Intra prediction 130 will generate syntax elements as well as coding related information such as motion information for further entropy coding 140. FIG. 1 actually illustrates a combined SVC system that provides spatial scalability as well as quality scalability (also called SNR scalability). The system may also provide temporal scalability, which is not explicitly shown. For each single-layer coding, the residual coding errors can be refined using SNR enhancement layer coding 150. The SNR enhancement layer in FIG. 1 may provide multiple quality levels (quality scalability). Each supported resolution layer can be coded by respective single-layer motion-compensated and Intra prediction like a non-scalable coding system. Each higher spatial layer may also be coded using inter-layer coding based on one or more lower spatial layers. For example, layer 1 video can be adaptively coded using inter-layer prediction based on layer 0 video or a single-layer coding on a macroblock by macroblock basis or other block unit. Similarly, layer 2 video can be adaptively coded using inter-layer prediction based on reconstructed layer 1 video or a single-layer coding. As shown in FIG. 1, layer-1 pictures 111 can be coded by motion-compensated and Intra prediction 131, base layer entropy coding 141 and SNR enhancement layer coding 151. As shown in FIG. 1, the reconstructed BL video data is also utilized by motion-compensated and Intra prediction 131, where a coding block in spatial layer 1 may use the reconstructed BL video data as an additional Intra prediction data (i.e., no motion compensation is involved). Similarly, layer-2 pictures 110 can be coded by motion-compensated and Intra prediction 132, base layer entropy coding 142 and SNR enhancement layer coding 152. The BL bitstreams and SNR enhancement layer bitstreams from all spatial layers are multiplexed by multiplexer 160 to generate a scalable bitstream. The coding efficiency can be improved due to inter-layer coding. Furthermore, the information required to code spatial layer 1 may depend on reconstructed layer 0 (inter-layer prediction). A higher layer in an SVC system is referred as an enhancement layer. The H.264 SVC provides three types of inter-layer prediction tools: inter-layer motion prediction, inter-layer Intra prediction, and inter-layer residual prediction.
In SVC, the enhancement layer (EL) can reuse the motion information in the base layer (BL) to reduce the inter-layer motion data redundancy. For example, the EL macroblock coding may use a flag, such as base_mode_flag before mb_type is determined to indicate whether the EL motion information is directly derived from the BL. If base_mode_flag is equal to 1, the partitioning data of the EL macroblock along with the associated reference indexes and motion vectors are derived from the corresponding data of the collocated 8×8 block in the BL. The reference picture index of the BL is directly used in the EL. The motion vectors of the EL are scaled from the data associated with the BL. Besides, the scaled BL motion vector can be used as an additional motion vector predictor for the EL.
Inter-layer residual prediction uses the up-sampled BL residual information to reduce the information required for coding the EL residuals. The collocated residual of the BL can be block-wise up-sampled using a bilinear filter and can be used as prediction for the residual of a corresponding macroblock in the EL. The up-sampling of the reference layer residual is done on transform block basis in order to ensure that no filtering is applied across transform block boundaries.
Similar to inter-layer residual prediction, the inter-layer Intra prediction reduces the redundant texture information of the EL. The prediction in the EL is generated by block-wise up-sampling the collocated BL reconstruction signal. In the inter-layer Intra prediction (ILIP, or so called inter-layer texture prediction) up-sampling procedure, 4-tap and 2-tap FIR filters are applied for luma and chroma components, respectively. Different from inter-layer residual prediction, filtering for the inter-layer Intra prediction is always performed across sub-block boundaries. For decoding simplicity, inter-layer Intra prediction can be applied only to the intra-coded macroblocks in the BL.
As shown in FIG. 1, reconstructed video at a lower layer is used for coding by a higher layer. The lower layer video corresponds to a lower spatial or temporal resolution, or lower quality (i.e., lower SNR). When the lower spatial resolution video in a lower layer is used by a higher layer coding, the lower spatial resolution video usually is up-sampled to match the spatial resolution of the higher layer. The up-sampling process artificially increases the spatial resolution. However, it also introduces undesirable artifacts. It is desirable to develop new techniques to use reconstructed video from a lower layer to improve the inter-layer coding efficiency.