Compressed digital video has been widely used in various applications such as video streaming over digital networks and video transmission over digital channels. Very often, a single video content may be delivered over networks with different characteristics. For example, a live sport event may be carried in a high-bandwidth streaming format over broadband networks for premium video service. In such applications, the compressed video usually preserves high resolution and high quality so that the video content is suited for high-definition devices such as an HDTV or a high resolution LCD display. The same content may also be carried through cellular data network so that the content can be watch on a portable device such as a smart phone or a network-connected portable media device. In such applications, due to the network bandwidth concerns as well as the typical low-resolution display on the smart phone or portable devices, the video content usually is compressed into lower resolution and lower bitrates. Therefore, for different network environment and for different applications, the video resolution and video quality requirement are quite different. Even for the same type of network, users may experience different available bandwidths due to different network infrastructure and network traffic condition. Therefore, a user may desire to receive the video at higher quality when the available bandwidth is high and receive a lower-quality, but smooth, video when the network congestion occurs. In another scenario, a high-end media player can handle high-resolution and high bitrate compressed video while a low-cost media player is only capable of handling low-resolution and low bitrate compressed video due to limited computational resources. Accordingly, it is desirable to construct the compressed video in a scalable manner so that video at different spatial-temporal resolution and/or quality can be derived from the same compressed bitstream.
In the current H.264/AVC video standard, there is an extension of the H.264/AVC standard, named Scalable Video Coding (SVC). SVC provides temporal, spatial, and quality scalabilities based on a single bitstream. The SVC bitstream contains scalable video information from low frame-rate, low resolution, and low quality to high frame rate, high definition, and high quality respectively. Accordingly, SVC is suitable for various video applications such as video broadcasting, video streaming, and video surveillance to adapt to network infrastructure, traffic condition, user preference, and etc.
In SVC, three types of scalabilities, i.e., temporal scalability, spatial scalability, and quality scalability, are provided. SVC uses multi-layer coding structure to realize the three dimensions of scalability. A main goal of SVC is to generate one scalable bitstream that can be easily and rapidly adapted to the bit-rate requirement associated with various transmission channels, diverse display capabilities, and different computational resources without trans-coding or re-encoding. An important feature of SVC design is that the scalability is provided at a bitstream level. In other words, bitstreams for deriving video with a reduced spatial and/or temporal resolution can be simply obtained by extracting Network Abstraction Layer (NAL) units (or network packets) from a scalable bitstream that are required for decoding the intended video. NAL units for quality refinement can be additionally truncated in order to reduce the bit-rate and the associated video quality.
For example, temporal scalability can be derived from hierarchical coding structure based on B-pictures according to the H.264/AVC standard. FIG. 1 illustrates an example of hierarchical B-picture structure with 4 temporal layers and the Group of Pictures (GOP) with eight pictures. Pictures 0 and 8 in FIG. 1 are called key pictures. Inter prediction of key pictures only uses previous key pictures as references. Other pictures between two key pictures are predicted hierarchically. Video having only the key pictures forms the coarsest temporal resolution of the scalable system. Temporal scalability is achieved by progressively refining a lower-level (coarser) video by adding more B pictures corresponding to enhancement layers of the scalable system. In the example of FIG. 1, picture 4 is first bi-directional predicted using key pictures, i.e., pictures 0 and 8 after the two key pictures are coded. After picture 4 is processed, pictures 2 and 6 are processed. Picture 2 is bi-directional predicted using picture 0 and 4, and picture 6 is bi-directional predicted using picture 4 and 8. After pictures 2 and 6 are coded, remaining pictures, i.e., pictures 1, 3, 5 and 7 are processed bi-directionally using two respective neighboring pictures as shown in FIG. 1. Accordingly, the processing order for the GOP is 0, 8, 4, 2, 6, 1, 3, 5, and 7. The pictures processed according to the hierarchical process of FIG. 1 results in hierarchical four-level pictures, where pictures 0 and 8 belong to the first temporal order, picture 4 belongs the second temporal order, pictures 2 and 6 belong to the third temporal order and pictures 1, 3, 5, and 7 belong to the fourth temporal order. By decoding the base level pictures and adding higher temporal order pictures will be able to provide a higher level video. For example, base-level pictures 0 and 8 can be combined with second temporal-order picture 4 to form second-level pictures. By further adding the third temporal-order pictures to the second-level video can form the third-level video. Similarly, by adding the fourth temporal-order pictures to the third-level video can form the fourth-level video. Accordingly, the temporal scalability is achieved. If the original video has a frame rate of 30 frames per second, the base-level video has a frame rate of 30/8=3.75 frames per second. The second-level, third-level and fourth-level video correspond to 7.5, 15, and 30 frames per second. The first temporal-order pictures are also called base-level video or based-level pictures. The second temporal-order pictures through fourth temporal-order pictures are also called enhancement-level video or enhancement-level pictures. In addition to enable temporal scalability, the coding structure of hierarchical B-pictures also improves the coding efficiency over the typical IBBP GOP structure at the cost of increased encoding-decoding delay.
In SVC, spatial scalability is supported based on the pyramid coding scheme as shown in FIG. 2. In a SVC system with spatial scalability, the video sequence is first down-sampled to obtain smaller pictures at different spatial resolutions (layers). For example, picture 210 at the original resolution can be processed by spatial decimation 220 to obtain resolution-reduced picture 211. The resolution-reduced picture 211 can be further processed by spatial decimation 221 to obtain further resolution-reduced picture 212 as shown in FIG. 2. In addition to dyadic spatial resolution, where the spatial resolution is reduced to half in each level, SVC also supports arbitrary resolution ratios, which is called extended spatial scalability (ESS). The SVC system in FIG. 2 illustrates an example of spatial scalable system with three layers, where layer 0 corresponds to the pictures with lowest spatial resolution and layer 2 corresponds to the pictures with the highest resolution. The layer-0 pictures are coded without reference to other layers, i.e., single-layer coding. For example, the lowest layer picture 212 is coded using motion-compensated and intra prediction 230.
The motion-compensated and intra prediction 230 will generate syntax elements as well as coding related information such as motion information for further entropy coding 240. FIG. 2 actually illustrates a combined SVC system that provides spatial scalability as well as quality scalability (also called SNR scalability). The system may also provide temporal scalability, which is not explicitly shown. For each single-layer coding, the residual coding errors can be refined using SNR enhancement layer coding 250. The SNR enhancement layer in FIG. 2 may provide multiple quality levels (quality scalability). Each supported resolution layer can be coded by respective single-layer motion-compensated and intra prediction like a non-scalable coding system. Each higher spatial layer may also be coded using inter-layer coding based on one or more lower spatial layers. For example, layer 1 video can be adaptively coded using inter-layer prediction based on layer 0 video or a single-layer coding on a macroblock by macroblock basis or other block unit. Similarly, layer 2 video can be adaptively coded using inter-layer prediction based on reconstructed layer 1 video or a single-layer coding. As shown in FIG. 2, layer-1 pictures 211 can be coded by motion-compensated and intra prediction 231, base layer entropy coding 241 and SNR enhancement layer coding 251. Similarly, layer-2 pictures 210 can be coded by motion-compensated and intra prediction 232, base layer entropy coding 242 and SNR enhancement layer coding 252. The coding efficiency can be improved due to inter-layer coding. Furthermore, the information required to code spatial layer 1 may depend on reconstructed layer 0 (inter-layer prediction). The inter-layer differences are termed as the enhancement layers. The H.264 SVC provides three types of inter-layer prediction tools: inter-layer motion prediction, inter-layer intra prediction, and inter-layer residual prediction.
In SVC, the enhancement layer (EL) can reuse the motion information in the base layer (BL) to reduce the inter-layer motion data redundancy. For example, the EL macroblock coding may use a flag, such as base_mode_flag before mb_type is determined to indicate whether the EL motion information is directly derived from the BL. If base_mode_flag is equal to 1, the partitioning data of the EL macroblock together with the associated reference indexes and motion vectors are derived from the corresponding data of the co-located 8×8 block in the BL. The reference picture index of the BL is directly used in EL. The motion vectors of EL are scaled from the data associated with the BL. Besides, the scaled BL motion vector can be used as an additional motion vector predictor for the EL.
Inter-layer residual prediction uses the up-sampled BL residual information to reduce the information of EL residuals. The co-located residual of BL can be block-wise up-sampled using a bilinear filter and can be used as prediction for the residual of a current macroblock in the EL. The up-sampling of the reference layer residual is done on a transform block basis in order to ensure that no filtering is applied across transform block boundaries.
Similar to inter-layer residual prediction, the inter-layer intra prediction reduces the redundant texture information of the EL. The prediction in the EL is generated by block-wise up-sampling the co-located BL reconstruction signal. In the inter-layer intra prediction up-sampling procedure, 4-tap and 2-tap FIR filters are applied for luma and chroma components, respectively. Different from inter-layer residual prediction, filtering for the inter-layer intra prediction is always performed across sub-block boundaries. For decoding simplicity, inter-layer intra prediction can be restricted to only intra-coded macroblocks in the BL.
In SVC, quality scalability is realized by coding multiple quality ELs which are composed of refinement coefficients. The scalable video bitstream can be easily truncated or extracted to provide different video bitstreams with different video qualities or bitstream sizes. In SVC, the quality scalability, (also called SNR scalability) can be provided via two strategies, coarse grain scalability (CGS), and medium grain scalability (MGS), The CGS can be regarded as a special case of spatial scalability, where the spatial resolution of the BL and the EL are the same. However, the quality of the EL is better (the QP of the EL is smaller than the QP of the BL). The same inter-layer prediction mechanism for spatial scalable coding can be employed. However, no corresponding up-sampling or deblocking operations are performed. Furthermore, the inter-layer intra and residual prediction are directly performed in the transform domain. For the inter-layer prediction in CGS, a refinement of texture information is typically achieved by re-quantizing the residual signal in the EL with a smaller quantization step size than that used for the preceding CGS layer. CGS can provide multiple pre-defined quality points.
To provide finer bit rate granularity while maintaining reasonable complexity for quality scalability, MGS is used by H.264 SVC. MGS can be considered as an extension of CGS, where the quantized coefficients in one CGS slice can be divided into several MGS slices. The quantized coefficients in CGS are classified to 16 categories based on its scan position in the zig-zag scan order. These 16 categories of coefficients can be distributed into different slices to provide more quality extraction points than CGS.
In the current HEVC, it only provides single layer coding based on hierarchical-B coding structure without any spatial scalability and quality scalability. It is desirable to provide the capability of spatial scalability and quality scalability to the current HEVC. Furthermore, it is desirable to provide improved SVC over the H.264 SVC to achieve higher efficiency and/or more flexibility.