Compressed digital video has been widely used in various applications such as video streaming over digital networks and video transmission over digital channels. Very often, a single video content may be delivered over networks with different characteristics. For example, a live sport event may be carried in a high-bandwidth streaming format over broadband networks for premium video service. In such applications, the compressed video usually preserves high resolution and high quality so that the video content is suited for high-definition devices such as an HDTV or a high resolution LCD display. The same content may also be carried through cellular data network so that the content can be watch on a portable device such as a smart phone or a network-connected portable media device. In such applications, due to the network bandwidth concerns as well as the typical low-resolution display on the smart phone or portable devices, the video content usually is compressed into lower resolution and lower bitrates. Therefore, for different network environment and for different applications, the video resolution and video quality requirements are quite different. Even for the same type of network, users may experience different available bandwidths due to different network infrastructure and network traffic condition. Therefore, a user may desire to receive the video at higher quality when the available bandwidth is high and receive a lower-quality, but smooth, video when the network congestion occurs. In another scenario, a high-end media player can handle high-resolution and high bitrate compressed video while a low-cost media player is only capable of handling low-resolution and low bitrate compressed video due to limited computational resources. Accordingly, it is desirable to construct the compressed video in a scalable manner so that videos at different spatial-temporal resolution and/or quality can be derived from the same compressed bitstream.
The joint video team (JVT) of ISO/IEC MPEG and ITU-T VCEG standardized a Scalable Video Coding (SVC) extension of the H.264/AVC standard. An H.264/AVC SVC bitstream can contain video information from low frame-rate, low resolution, and low quality to high frame rate, high definition, and high quality. This single bitstream can be adapted to various applications and displayed on devices with different configurations. Accordingly, H.264/AVC SVC is suitable for various video applications such as video broadcasting, video streaming, and video surveillance to adapt to network infrastructure, traffic condition, user preference, and etc.
In SVC, three types of scalabilities, i.e., temporal scalability, spatial scalability, and quality scalability, are provided. SVC uses multi-layer coding structure to realize the three dimensions of scalability. A main goal of SVC is to generate one scalable bitstream that can be easily and rapidly adapted to the bit-rate requirement associated with various transmission channels, diverse display capabilities, and different computational resources without trans-coding or re-encoding. An important feature of the SVC design is that the scalability is provided at a bitstream level. In other words, bitstreams for deriving video with a reduced spatial and/or temporal resolution can be simply obtained by extracting Network Abstraction Layer (NAL) units (or network packets) from a scalable bitstream that are required for decoding the intended video. NAL units for quality refinement can be additionally truncated in order to reduce the bit-rate and the associated video quality.
In SVC, spatial scalability is supported based on the pyramid coding scheme as shown in FIG. 1. In a SVC system with spatial scalability, the video sequence is first down-sampled to obtain smaller pictures at different spatial resolutions (layers). For example, picture 110 at the original resolution can be processed by spatial decimation 120 to obtain resolution-reduced picture 111. The resolution-reduced picture 111 can be further processed by spatial decimation 121 to obtain further resolution-reduced picture 112 as shown in FIG. 1. In addition to dyadic spatial resolution, where the spatial resolution is reduced to half in each level, SVC also supports arbitrary resolution ratios, which is called extended spatial scalability (ESS). The SVC system in FIG. 1 illustrates an example of spatial scalable system with three layers, where layer 0 corresponds to the pictures with lowest spatial resolution and layer 2 corresponds to the pictures with the highest resolution. The layer-0 pictures are coded without reference to other layers, i.e., single-layer coding. For example, the lowest layer picture 112 is coded using motion-compensated and Intra prediction 130.
The motion-compensated and Intra prediction 130 will generate syntax elements as well as coding related information such as motion information for further entropy coding 140. FIG. 1 actually illustrates a combined SVC system that provides spatial scalability as well as quality scalability (also called SNR scalability). The system may also provide temporal scalability, which is not explicitly shown. For each single-layer coding, the residual coding errors can be refined using SNR enhancement layer coding 150. The SNR enhancement layer in FIG. 1 may provide multiple quality levels (quality scalability). Each supported resolution layer can be coded by respective single-layer motion-compensated and Intra prediction like a non-scalable coding system. Each higher spatial layer may also be coded using inter-layer coding based on one or more lower spatial layers. For example, layer 1 video can be adaptively coded using inter-layer prediction based on layer 0 video or a single-layer coding on a macroblock by macroblock basis or other block unit. Similarly, layer 2 video can be adaptively coded using inter-layer prediction based on reconstructed layer 1 video or a single-layer coding. As shown in FIG. 1, layer-1 pictures 111 can be coded by motion-compensated and Intra prediction 131, base layer entropy coding 141 and SNR enhancement layer coding 151. As shown in FIG. 1, the reconstructed BL video data is also utilized by motion-compensated and Intra prediction 131, where a coding block in spatial layer 1 may use the reconstructed BL video data as an additional Intra prediction data (i.e., no motion compensation is involved). Similarly, layer-2 pictures 110 can be coded by motion-compensated and Intra prediction 132, base layer entropy coding 142 and SNR enhancement layer coding 152. The BL bitstreams and SNR enhancement layer bitstreams from all spatial layers are multiplexed by multiplexer 160 to generate a scalable bitstream. The coding efficiency can be improved due to inter-layer coding. Furthermore, the information required to code spatial layer 1 may depend on reconstructed layer 0 (inter-layer prediction). A higher layer in an SVC system is referred as an enhancement layer. The H.264 SVC provides three types of inter-layer prediction tools: inter-layer motion prediction, inter-layer Intra prediction, and inter-layer residual prediction.
In SVC, the enhancement layer (EL) can reuse the motion information in the base layer (BL) to reduce the inter-layer motion data redundancy. For example, the EL macroblock coding may use a flag, such as base_mode_flag before mb_type is determined to indicate whether the EL motion information is directly derived from the BL. If base_mode_flag is equal to 1, the partitioning data of the EL macroblock along with the associated reference indexes and motion vectors are derived from the corresponding data of the collocated 8×8 block in the BL. The reference picture index of the BL is directly used in the EL. The motion vectors of the EL are scaled from the data associated with the BL. Besides, the scaled BL motion vector can be used as an additional motion vector predictor for the EL.
Inter-layer residual prediction uses the up-sampled BL residual information to reduce the information required for coding the EL residuals. The collocated residual of the BL can be block-wise up-sampled using a bilinear filter and can be used as prediction for the residual of a corresponding macroblock in the EL. The up-sampling of the reference layer residual is done on transform block basis in order to ensure that no filtering is applied across transform block boundaries.
Similar to inter-layer residual prediction, the inter-layer Intra prediction reduces the redundant texture information of the EL. The prediction in the EL is generated by block-wise up-sampling the collocated BL reconstruction signal. In the inter-layer Intra prediction up-sampling procedure, 4-tap and 2-tap FIR filters are applied for luma and chroma components, respectively. Different from inter-layer residual prediction, filtering for the inter-layer Intra prediction is always performed across sub-block boundaries. For decoding simplicity, inter-layer Intra prediction can be applied only to the intra-coded macroblocks in the BL.
As shown in FIG. 1, reconstructed video at a lower layer is used for coding by a higher layer. The lower layer video corresponds to lower spatial or temporal resolution, or lower quality (i.e., lower SNR). When the lower spatial resolution video in a lower layer is used by a higher layer coding, the lower spatial resolution video usually is up-sampled to match the spatial resolution of the higher layer. The up-sampling process artificially increases the spatial resolution. However, it also introduces undesirable artifacts. It is desirable to develop new techniques to use reconstructed video at a layer with lower spatial resolution to improve the inter-layer coding efficiency.
In HEVC, two new in-loop filters have been developed to improve the coding efficiency. One of the in-loop filters is called sample adaptive offset (SAO), which performs pixel classification on the reconstructed pictures and then derive an offset for each group of pixels to reduce the distortion between original pictures and reconstructed pictures. The other in-loop filter is called adaptive loop filter (ALF), which is applied as a set of Wiener filters to minimize the mean-squared error (MSE) between the reconstructed and original pictures. FIG. 2 illustrates an exemplary system diagram for an HEVC encoder, where SAO 231 and ALF 232 are applied to reconstructed video data processed by deblocking filter (DF) 230.
FIG. 2 illustrates an exemplary adaptive inter/Intra video encoding system incorporating in-loop processing. For inter-prediction, Motion Estimation (ME)/Motion Compensation (MC) 212 is used to provide prediction data based on reconstructed video data corresponding to other picture frames. Mode decision 214 selects Intra Prediction 210 or inter-prediction (i.e., ME/MC processed) data 212 and the selected prediction data is supplied to Adder 216 to form prediction errors, also called residues. The prediction error is then processed by Transformation (T) 218 followed by Quantization (Q) 220. The transformed and quantized residues are then coded by Entropy Encoder 222 to form a video bitstream corresponding to the compressed video data. The bitstream associated with the transform coefficients is then packed with side information such as motion, mode, and other information associated with the image area. The side information may also be processed by entropy coding to reduce required bandwidth. Accordingly, the data associated with the side information of SAO and ALF are also provided to Entropy Encoder 222 as shown in FIG. 2. When an inter-prediction mode is used, a reference picture or pictures have to be reconstructed at the encoder end as well. Consequently, the transformed and quantized residues are processed by Inverse Quantization (IQ) 224 and Inverse Transformation (IT) 226 to recover the residues. The residues are then added back to prediction data 236 at Reconstruction (REC) 228 to reconstruct the video data. The reconstructed video data may be stored in Reference Picture Buffer 234 and used for prediction of other frames.
As shown in FIG. 2, incoming video data undergoes a series of processing in the encoding system. The reconstructed video data from REC 228 may be subject to various impairments due to a series of processing. Accordingly, various in-loop processing is applied to the reconstructed video data before the reconstructed video data are stored in the Reference Picture Buffer 234 in order to improve video quality. In the High Efficiency Video Coding (HEVC) standard being developed, Deblocking Filter (DF) 230 and Sample Adaptive Offset (SAO) 231 have been developed to enhance picture quality. The filter operations of SAO 231 and ALF 232 are adaptive and filter information may have to be incorporated in the bitstream so that a decoder can properly recover the required information. Therefore, loop filter information from SAO 231 and ALF 232 is provided to Entropy Encoder 222 for incorporation into the bitstream. In FIG. 2, DF 230 is applied to the reconstructed video first and SAO 231 and ALF 232 are then applied to DF-processed video.
The coding process in HEVC is applied to each image region named Largest Coding Unit (LCU). The LCU is adaptively partitioned into coding units using quadtree. The LCU is also called Coding Tree Block (CTB). For each leaf CU, DF is performed for each 8×8 block in HEVC. For each 8×8 block, horizontal filtering across vertical block boundaries is first applied, and then vertical filtering across horizontal block boundaries is applied.
In SAO, pixel classification is first done to classify pixels into different categories. Upon the classification of all pixels in a picture or a region, one offset is derived and transmitted for each group of pixels. For SAO, one picture is divided into multiple LCU-aligned regions. Each region can select one SAO type among two Band Offset (BO) types, four Edge Offset (EO) types, and no processing (OFF). For each to-be-processed (also called to-be-filtered) pixel, BO uses the pixel intensity to classify the pixel into a band. As for EO, it uses two neighboring pixels of a to-be-processed pixel to classify the pixel into a category. The four EO types correspond to 0°, 90°, 135°, and 45° as shown in FIG. 3. Similar to BO, one offset value is derived for all pixels of each category except for category 0, where Category 0 is forced to use zero offset. Table 1 shows the EO pixel classification, where “C” denotes the pixel to be classified. Therefore, four offset values are coded for each coding tree block (CTB) or Largest Coding Unit (LCU) when EO types are used.
TABLE 1CategCondition1C < two neighbors2C < one neighbor && C == one3C > one neighbor && C == one4C > two neighbors0None of the above
In a single layer coding system such as HEVC, the coding efficiency has been benefited from the use of in-loop filters. It is desirable to apply in-loop filters to scalable video system to improve the coding efficiency.