Video streaming has become a mainstream for video delivery today. Supported by the high-speed ubiquitous internet as well as mobile networks, video contents can be delivered to end users for viewing on different platforms with different qualities. In order to fulfill different requirements for various video stream applications, a video source may have to be processed or stored at different resolutions, frame rates, and/or qualities. It would result in fairly complicated system and require high overall bandwidth or large overall storage space. One solution to satisfy requirements for different resolutions, frame rates, qualities and/or bitrates is scalable video coding. Beside various proprietary development efforts to address this problem, there is also an existing video standard for scalable video coding. The joint video team (JVT) of ISO/IEC MPEG and ITU-T VCEG has standardized a Scalable Video Coding (SVC) extension to the H.264/AVC standard. An H.264/AVC SVC bitstream can contain video information ranging from low frame-rate, low resolution and low quality to high frame rate, high definition and high quality. This single bitstream can be adapted to a specific application by properly configuring the scalability of the bitstream. For example, the complete bitstream corresponding to a high definition video can be delivered over high-speed networks to provide full quality intended for viewing on large screen TV. A portion of the bitstream corresponding to a low-resolution version of the high definition video can be delivered over legacy cellular networks for intended viewing on handheld/mobile devices. Accordingly, a bitstream generated using H.264/AVC SVC is suitable for various video applications such as video broadcasting, video streaming, and surveillance.
In SVC, three types of scalabilities, i.e., temporal scalability, spatial scalability, and quality scalability are provided. SVC uses a multi-layer coding structure to render three dimensions of scalability. The concept of SVC is to generate one scalable bitstream that can be easily and quickly adapted to fit the bit-rate of various transmission channels, diverse display capabilities, and/or different computational resources without the need of transcoding or re-encoding. An important feature of SVC design is to provide scalability at the bitstream level. Bitstreams for a reduced spatial and/or temporal resolution can be simply obtained by discarding NAL units (or network packets) that are not required for decoding the target resolution. NAL units for quality refinement can be additionally truncated in order to reduce the bit-rate and/or the corresponding video quality.
In the H.264/AVC SVC extension, spatial scalability is supported based on the pyramid coding. First, the video sequence is down-sampled to smaller pictures with different spatial resolutions (layers). The lowest layer (i.e., the layer with lowest spatial resolution) is called a base layer (BL). Any layer above the base layer is called an enhancement layer (EL). In addition to dyadic spatial resolution, the H.264/AVC SVC extension also supports arbitrary resolution ratios, which is called extended spatial scalability (ESS). In order to improve the coding efficiency of the enhancement layers (video layers with larger resolutions), various inter-layer prediction schemes have been disclosed in the literature. Three inter-layer prediction tools have been adopted in SVC, including inter-layer motion prediction, inter-layer Intra prediction and inter-layer residual prediction (e.g., C. Andrew Segall and Gary J. Sullivan, “Spatial Scalability Within the H.264/AVC Scalable Video Coding Extension”, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 17, No. 9, Pages 1121-1135, September 2007).
In SVC, spatial scalability is supported based on the pyramid coding scheme as shown in FIG. 1. In a SVC system with spatial scalability, the video sequence is first down-sampled to obtain smaller pictures at different spatial resolutions (layers). For example, picture 110 at the original resolution can be processed by spatial decimation 120 to obtain resolution-reduced picture 111. The resolution-reduced picture 111 can be further processed by spatial decimation 121 to obtain further resolution-reduced picture 112 as shown in FIG. 1. In addition to dyadic spatial resolution, where the spatial resolution is reduced to half in each level, SVC also supports arbitrary resolution ratios, which is called extended spatial scalability (ESS). The SVC system in FIG. 1 illustrates an example of spatial scalable system with three layers, where layer 0 corresponds to the pictures with lowest spatial resolution and layer 2 corresponds to the pictures with the highest resolution. The layer-0 pictures are coded without reference to other layers, i.e., single-layer coding. For example, the lowest layer picture 112 is coded using motion-compensated and Intra prediction 130.
The motion-compensated and Intra prediction 130 will generate syntax elements as well as coding related information such as motion information for further entropy coding 140. FIG. 1 actually illustrates a combined SVC system that provides spatial scalability as well as quality scalability (also called SNR scalability). The system may also provide temporal scalability, which is not explicitly shown. For each single-layer coding, the residual coding errors can be refined using SNR enhancement layer coding 150. The SNR enhancement layer in FIG. 1 may provide multiple quality levels (quality scalability). Each supported resolution layer can be coded by respective single-layer motion-compensated and Intra prediction like a non-scalable coding system. Each higher spatial layer may also be coded using inter-layer coding based on one or more lower spatial layers. For example, layer 1 video can be adaptively coded using inter-layer prediction based on layer 0 video or a single-layer coding on a macroblock by macroblock basis or other block unit. Similarly, layer 2 video can be adaptively coded using inter-layer prediction based on reconstructed layer 1 video or a single-layer coding. As shown in FIG. 1, layer-1 pictures 111 can be coded by motion-compensated and Intra prediction 131, base layer entropy coding 141 and SNR enhancement layer coding 151. Similarly, layer-2 pictures 110 can be coded by motion-compensated and Intra prediction 132, base layer entropy coding 142 and SNR enhancement layer coding 152. The coding efficiency can be improved due to inter-layer coding. Furthermore, the information required to code spatial layer 1 may depend on reconstructed layer 0 (inter-layer prediction). Higher spatial resolution layers such as layer 1 and layer 2 are termed as the enhancement layers (EL). The H.264 SVC provides three types of inter-layer prediction tools: inter-layer motion prediction, inter-layer Intra prediction, and inter-layer residual prediction.
In SVC, the enhancement layer (EL) can reuse the motion information in the base layer (BL) to reduce the inter-layer motion data redundancy. For example, the EL macroblock coding may use a flag, such as base_mode_flag before mb_type is determined to indicate whether the EL motion information is directly derived from the base layer (BL). If base_mode_flag is equal to 1, the partitioning data of the EL macroblock together with the associated reference indexes and motion vectors are derived from the corresponding data of the collocated 8×8 block in the BL. The reference picture index of the BL is directly used in EL. The motion vectors of EL are scaled from the data associated with the BL. Besides, the scaled BL motion vector can be used as an additional motion vector predictor for the EL.
Inter-layer residual prediction uses the up-sampled BL residual information to reduce the information of EL residuals. The collocated residual of BL can be block-wise up-sampled using a bilinear filter and can be used as prediction for the residual of a current macroblock in the EL. The up-sampling of the reference layer residual is done on a transform block basis in order to ensure that no filtering is applied across transform block boundaries.
Similar to inter-layer residual prediction, the inter-layer Intra prediction reduces the redundant texture information of the EL. The prediction in the EL is generated by block-wise up-sampling the collocated BL reconstruction signal. In the inter-layer Intra prediction up-sampling procedure, 4-tap and 2-tap FIR filters are applied for luma and chroma components, respectively. Different from inter-layer residual prediction, filtering for the inter-layer Intra prediction is always performed across sub-block boundaries. For decoding simplicity, inter-layer Intra prediction can be restricted to only Intra-coded macroblocks in the BL.
In the emerging High Efficiency Video Coding (HEVC) standard, Intra prediction has more modes to use. For example, 35 Intra modes (mode 0 to mode 34) are used for 8×8, 16×16, 32×32 and 64×64 prediction units (PUs) of the luma component. The 35 Intra prediction modes include DC mode, Planar mode, and 33 directional prediction modes. For the chroma component, a new Intra prediction mode, called LM mode, is used. The rationale of LM mode is that, usually there is some correlation between the luma component and the chroma component. Accordingly, the LM mode uses reconstructed Intra-coded luma block to form prediction for a chroma block. Furthermore, previously reconstructed surrounding pixels of a collocated luma block and previously reconstructed surrounding pixels of a current chroma block are used to derive parameters of the LM Intra prediction based on a least-squares criterion. The LM mode predictor for the chroma pixel at pixel location (x, y) is denoted as Predc(x,y). The collocated reconstructed luma pixel is denoted as Rec'L(x,y). The relationship between Predc (x,y) and Rec'L(x,y) is described by equation (1):Predc(x,y)=α·Rec'L(x,y)+β,  (1)where α and β are the parameters of the LM Intra prediction and can be derived by using a least-squares method.
FIGS. 2A-B illustrate an example of the LM prediction process. First, the neighboring reconstructed pixels 212 of a collocated luma block 210 in FIG. 2A and the neighboring reconstructed pixels 222 of a chroma block 220 in FIG. 2B are used to evaluate the correlation between the blocks. The parameters of the LM Intra prediction are derived accordingly. Then, the predicted pixels of the chroma block are generated from the reconstructed pixels of the luma block (i.e., a luma prediction unit or luma PU) using the derived parameters. In the parameters derivation, the first above reconstructed pixel row and the second left reconstructed pixel column of the current luma block are used. The specific row and column of the luma block are used in order to match the 4:2:0 sampling format of the chroma components. The following illustration is based on 4:2:0 sampling format. LM-mode chroma Intra prediction for other sampling formats can be derived similarly. The collocated boundary pixels 212 of the luma PU 210, as shown in FIG. 2A, are collected and boundary pixels 222 of the chroma PU 220, as shown in FIG. 2B, are also collected. By using a least-squares method, the linear relationship between these two sets can be derived. In other words, α and β can be derived based on luma and chroma boundary pixels. Besides, two short-tap sampling filters, [1,2,1] and [1,1], may be applied to those sampled horizontal and vertical luma boundary pixels, respectively.
The conventional Intra prediction does not address Intra prediction for inter-layer. It is desirable to extend Intra prediction to scalable video coding to improve the coding performance.
Three-dimensional (3D) television has been a technology trend in recent years that intends to bring viewers sensational viewing experience. Various technologies have been developed to enable 3D viewing. Among them, the multi-view video is a key technology for 3DTV application among others. The traditional video is a two-dimensional (2D) medium that only provides viewers a single view of a scene from the perspective of the camera. However, the multi-view video is capable of offering arbitrary viewpoints of dynamic scenes and provides viewers the sensation of realism.
The multi-view video is typically created by capturing a scene using multiple cameras simultaneously, where the multiple cameras are properly located so that each camera captures the scene from one viewpoint. Accordingly, the multiple cameras will capture multiple video sequences corresponding to multiple views. In order to provide more views, more cameras have been used to generate multi-view video with a large number of video sequences associated with the views. Accordingly, the multi-view video will require a large storage space to store and/or a high bandwidth to transmit. Therefore, multi-view video coding techniques have been developed in the field to reduce the required storage space or the transmission bandwidth.
A straightforward approach may be to simply apply conventional video coding techniques to each single-view video sequence independently and disregard any correlation among different views. Such coding system would be very inefficient. In order to improve efficiency of multi-view video coding, typical multi-view video coding exploits inter-view redundancy. Therefore, most 3D Video Coding (3DVC) systems take into account of correlation of video data associated with multiple views and depth maps.
FIG. 3 illustrates generic prediction structure for 3D video coding. The incoming 3D video data consists of images (310-0, 310-1, 310-2, . . . ) corresponding to multiple views. The images collected for each view form an image sequence for the corresponding view. Usually, the image sequence 310-0 corresponding to a base view (also called an independent view) is coded independently by a video coder 330-0 conforming to a video coding standard such as H.264/AVC or HEVC (High Efficiency Video Coding). The video coders (330-1, 330-2, . . . ) for image sequences associated with dependent views (i.e., views 1, 2, . . . ) which are also called enhancement views further utilize inter-view prediction in addition to temporal prediction. The inter-view predictions are indicated by the short-dashed lines in FIG. 3.
In order to support interactive applications, depth maps (320-0, 320-1, 320-2, . . . ) associated with a scene at respective views are also included in the video bitstream. In order to reduce data associated with the depth maps, the depth maps are compressed using depth map coder (340-0, 340-1, 340-2 . . . ) and the compressed depth map data is included in the bit stream as shown in FIG. 3. A multiplexer 350 is used to combine compressed data from image coders and depth map coders. The depth information can be used for synthesizing virtual views at selected intermediate viewpoints.
As shown in FIG. 3, the conventional 3D video coding system does not take into consideration of an inter-view correlation for Intra prediction. It is desirable to extend Intra prediction to 3D video coding to improve the coding performance.