Compressed digital video has been widely used in various applications such as video streaming over digital networks and video transmission over digital channels. Very often, a single video content may be delivered over networks with different characteristics. For example, a live sport event may be carried in a high-bandwidth streaming format over broadband networks for premium video service. In such applications, the compressed video usually preserves high resolution and high quality so that the video content is suited for high-definition devices such as an HDTV or a high resolution LCD display. The same content may also be carried through cellular data network so that the content can be watch on a portable device such as a smart phone or a network-connected portable media device. In such applications, due to the network bandwidth concerns as well as the typical low-resolution display on the smart phone or portable devices, the video content usually is compressed into lower resolution and lower bitrates. Therefore, for different network environment and for different applications, the video resolution and video quality requirements are quite different. Even for the same type of network, users may experience different available bandwidths due to different network infrastructure and network traffic condition. Therefore, a user may desire to receive the video at higher quality when the available bandwidth is high and receive a lower-quality, but smooth, video when the network congestion occurs. In another scenario, a high-end media player can handle high-resolution and high bitrate compressed video while a low-cost media player is only capable of handling low-resolution and low bitrate compressed video due to limited computational resources. Accordingly, it is desirable to construct the compressed video in a scalable manner so that videos at different spatial-temporal resolution and/or quality can be derived from the same compressed bitstream.
The joint video team (JVT) of ISO/IEC MPEG and ITU-T VCEG standardized a Scalable Video Coding (SVC) extension of the H.264/AVC standard. An H.264/AVC SVC bitstream can contain video information from low frame-rate, low resolution, and low quality to high frame rate, high definition, and high quality. This single bitstream can be adapted to various applications and displayed on devices with different configurations. Accordingly, H.264/AVC SVC is suitable for various video applications such as video broadcasting, video streaming, and video surveillance to adapt to network infrastructure, traffic condition, user preference, and etc.
In SVC, three types of scalabilities, i.e., temporal scalability, spatial scalability, and quality scalability, are provided. SVC uses multi-layer coding structure to realize the three dimensions of scalability. A main goal of SVC is to generate one scalable bitstream that can be easily and rapidly adapted to the bit-rate requirement associated with various transmission channels, diverse display capabilities, and different computational resources without trans-coding or re-encoding. An important feature of the SVC design is that the scalability is provided at a bitstream level. In other words, bitstreams for deriving video with a reduced spatial and/or temporal resolution can be simply obtained by extracting Network Abstraction Layer (NAL) units (or network packets) from a scalable bitstream that are required for decoding the intended video. NAL units for quality refinement can be additionally truncated in order to reduce the bit-rate and the associated video quality. In SVC, temporal scalability is provided by using the hierarchical B-pictures coding structure. SNR scalability is realized by coding higher quality Enhancement Layers (ELs) which comprise refinement coefficients.
In SVC, spatial scalability is supported based on the pyramid coding scheme as shown in FIG. 1. In a SVC system with spatial scalability, the video sequence is first down-sampled to obtain smaller pictures at different spatial resolutions (layers). For example, picture 110 at the original resolution can be processed by spatial decimation 120 to obtain resolution-reduced picture 111. The resolution-reduced picture 111 can be further processed by spatial decimation 121 to obtain further resolution-reduced picture 112 as shown in FIG. 1. In addition to dyadic spatial resolution, where the spatial resolution is reduced to half in each level, SVC also supports arbitrary resolution ratios, which is called extended spatial scalability (ESS). The SVC system in FIG. 1 illustrates an example of spatial scalable system with three layers, where layer 0 corresponds to the pictures with lowest spatial resolution and layer 2 corresponds to the pictures with the highest resolution. The layer-0 pictures are coded without reference to other layers, i.e., single-layer coding. For example, the lowest layer picture 112 is coded using motion-compensated and Intra prediction 130.
The motion-compensated and Intra prediction 130 will generate syntax elements as well as coding related information such as motion information for further entropy coding 140. FIG. 1 actually illustrates a combined SVC system that provides spatial scalability as well as quality scalability (also called SNR (Signal to Noise Ratio) scalability). The system may also provide temporal scalability, which is not explicitly shown. For each single-layer coding, the residual coding errors can be refined using SNR enhancement layer coding 150. The SNR enhancement layer in FIG. 1 may provide multiple quality levels (quality scalability). Each supported resolution layer can be coded by respective single-layer motion-compensated and Intra prediction like a non-scalable coding system. Each higher spatial layer may also be coded using inter-layer coding based on one or more lower spatial layers. For example, layer 1 video can be adaptively coded using inter-layer prediction based on layer 0 video or a single-layer coding on a macroblock by macroblock basis or other block unit. Similarly, layer 2 video can be adaptively coded using inter-layer prediction based on reconstructed layer 1 video or a single-layer coding. As shown in FIG. 1, layer-1 pictures 111 can be coded by motion-compensated and Intra prediction 131, base layer entropy coding 141 and SNR enhancement layer coding 151. As shown in FIG. 1, the reconstructed base layer (BL) video data is also utilized by motion-compensated and Intra prediction 131, where a coding block in spatial layer 1 may use the reconstructed BL video data as an additional Intra prediction data (i.e., no motion compensation is involved). Similarly, layer-2 pictures 110 can be coded by motion-compensated and Intra prediction 132, base layer entropy coding 142 and SNR enhancement layer coding 152. The BL bitstreams and SNR enhancement layer bitstreams from all spatial layers are multiplexed by multiplexer 160 to generate a scalable bitstream. The coding efficiency can be improved due to inter-layer coding. Furthermore, the information required to code spatial layer 1 may depend on reconstructed layer 0 (inter-layer prediction). A higher layer in an SVC system is referred as an enhancement layer. The H.264 SVC provides three types of inter-layer prediction tools: inter-layer motion prediction, inter-layer texture prediction (or so-called inter-layer Intra prediction), and inter-layer residual prediction.
In SVC, the enhancement layer (EL) can reuse the motion information in the base layer (BL) to reduce the inter-layer motion data redundancy. For example, the EL macroblock coding may use a flag, such as base_mode_flag before mb_type is determined to indicate whether the EL motion information is directly derived from the BL. If base_mode_flag is equal to 1, the partitioning data of the EL macroblock along with the associated reference indexes and motion vectors are derived from the corresponding data of the collocated 8×8 block in the BL. The reference picture index of the BL is directly used in the EL. The motion vectors of the EL are scaled from the data associated with the BL. Besides, the scaled BL motion vector can be used as an additional motion vector predictor for the EL.
Inter-layer residual prediction uses the up-sampled BL residual information to reduce the information required for coding the EL residuals. The collocated residual of the BL can be block-wise up-sampled using a bilinear filter and can be used as prediction for the residual of a corresponding macroblock in the EL. The up-sampling of the reference layer residual is done on transform block basis in order to ensure that no filtering is applied across transform block boundaries.
The inter-layer texture prediction reduces the redundant texture information of EL. The prediction in the EL is generated by block-wise up-sampling the collocated BL reconstruction signal. In the inter-layer texture prediction up-sampling procedure, 4-tap and 2-tap FIR filters are applied for luma and chroma components, respectively. Different from inter-layer residual prediction, filtering for the inter-layer Intra prediction is always performed across sub-block boundaries. For decoding simplicity, inter-layer Intra prediction can be applied only to the intra-coded macroblocks in the BL.
In SVC, the motion information of a block in the EL may use the motion information within the corresponding block in the BL. For example, the motion information associated with locations a-h in the collocated block in the BL as shown in FIG. 2 can be used to derive inter-layer prediction. In FIG. 2, block 210 corresponds to a current block in the EL and block 220 is the corresponding block in the BL. the motion information at a, b, g, and h in the BL are the corresponding motion information of A, B, G, and H in the EL. The c, d, e, f are the corresponding motion information of C, D, E, and F in the EL. Locations A, B, G, and H are the four corner pixels of the current block in the EL and locations C, D, E, and F are the four center pixels of the current block in the EL.
Not only the motion information of the corresponding block in the BL, but also the motion information of neighboring blocks of the corresponding block in the BL can be utilized as inter-layer candidates for the EL to include in the Merge/AMVP candidate list. As shown in FIG. 2, the neighboring candidates in the BL, including t (bottom-right), a0 (bottom-left), a1 (left), b0 (upper-right), b1 (top), and b2 (upper-left) neighboring BL blocks, can be used as a candidate for the EL to include in the Merge/AMVP candidate derivation. The collocated EL neighboring blocks correspond to T (bottom-right), A0 (bottom-left), A1 (left), B0 (upper-right), B1 (top), and B2 (upper-left) neighboring EL blocks respectively.
High-Efficiency Video Coding (HEVC) is a new international video coding standard being developed by the Joint Collaborative Team on Video Coding (JCT-VC). The scalable extension to the HEVC (i.e., SHVC) is also being developing. In HEVC, motion information of neighboring blocks in the spatial and temporal domains is used to derive the Merge and MVP (motion vector prediction) candidates. The motion information includes Inter prediction direction (inter_pred_idc), reference indexes (refIdx), motion vectors (MVs), motion vector predictors (MVPs), MVP indexes, Merge indexes, Merge candidates, etc. In the derivation process for the spatial MVPs, the MVP can be derived from the MV pointing to the same reference picture as the target reference picture, or from the MV pointing to different reference pictures. When the MVP is derived from a MV pointing to a different reference picture, the MV is scaled to the target reference picture and used as the final MVP. In the derivation process for the spatial and temporal MVPs, the division is required to scale the motion vector. The scaling factor is calculated based on the ratio of the distance between current picture and the target reference picture and the distance between the collocated picture and the reference picture for the collocated block. In the MV scaling process, the scaling factor is defined by equation (1):ScalingFactor=(POCcurr−POCref)/(POCcol−POCcol_ref)=tb/td,  (1)where td is the POC (picture order count) distance between the collocated picture and the reference picture pointed to by the MV of the collocated block, and tb is the POC distance between the current picture and the target reference picture. The scaling factor for spatial MVP derivation is derived similarly. In HEVC, the scaling factor is calculated as follows:X=(2^14+|td/2|)/td, and  (2)ScalingFactor=clip(−4096,4095,(tb×X+32)>>6).  (3)
The scaled MV is then derived as follows:ScaledMV=sign(ScalingFactor×MV)×((abs(ScalingFactor×MV)+127))>>8  (4)
In SHVC Test Model 1.0 (SHM-1.0), the inter-layer texture prediction can be implemented in two schemes. The first scheme uses CU-level signaling to indicate whether the predictor of this CU is from the up-sampled BL texture or not, where Intra_BL mode is used for signaling the selection. The second scheme incorporates the up-sampled BL texture into reference frame list. In other words, the reference picture associated with the up-sampled BL texture is assigned a reference picture index, i.e., RefIdx. This scheme is referred to as RefIdx mode. Motion information associated with a reference picture is also stored and used for Inter prediction. Accordingly, for the up-scaled BL reference, the associated MVs have to be up-scaled as well. RefIdx mode has least impact on the existing HEVC syntax.
In SHM-1.0 Intra_BL mode, the center MV of the corresponding block in the BL (i.e., the MV at position “f” in FIG. 2) is scaled and set to the first Merge candidate in the EL Merge candidate list as an inter-layer Merge candidate. The MV scaling process for inter-layer Merge candidate is different from the MV scaling process in HEVC. In SHM-1.0, the base layer (BL) MV is scaled based on the ratio of video resolution between enhancement layer (EL) and BL. The scaled MV is derived as follows:mvEL_X=(mvBL_X×picEL_W+(picBL_W/2−1)×sign(mvBL_X))/picBL_W, and  (5)mvEL_Y=(mvBL_Y×picEL_H+(picBL_H/2−1)×sign(mvBL_Y))/picBL_H,  (6)where (mvEL_X, mvEL_Y) is the scaled MV in the EL, (mvBL_X, mvBL_Y) is the center MV of the corresponding block in the BL, picEL_W and picEL_H are the picture width and height of the EL picture, and picBL_W and picBL_H are the picture width and height of the BL picture.
In SHM-1.0, for a pixel in the EL with the pixel position equal to (xEL, yEL), the pixel position mapping used to derive the reference pixel in the BL of the corresponding EL pixel can be illustrated as follows:xBL=(xEL×picBL_W+picEL_W/2)/picEL_W, and  (7)yBL=(yEL×picBL_H+picEL_H/2)/picEL_H,  (8)where (xBL, yBL) is the pixel position of the reference pixel in the BL, picEL_W and picEL_H are the picture width and height of the EL picture, and picBL_W and picBL_H are the picture width and height of the BL picture.
In SHM-1.0 texture up-sampling, the BL pixel position of the reference pixel in the BL is outputted in unit of 1/16-th sample. The derivation of the BL pixel position in unit of 1/16-th sample is illustrated as follows.
The variable xBL16 is derived as follows:xBL16=(xEL×picBL_W×16+picEL_W/2)/picEL_W. 
The variable yBL16 is derived as follows:                If cIdx is equal to 0, the variable yBL16 is derived as follows:yBL16=(yEL×picBL_H×16+picEL_H/2)/picEL_H,         otherwise, the variable yBL16 is derived as follows:yBL16=(yEL×picBL_H×16+picEL_H/2)/picEL_H−offset,where the cIdx is the color component index, and offset is derived as follows:        if (picEL_H==picBL_H)                    offset=0;                        otherwise if (picEL_H==1.5*picBL_H)                    offset=1; and                        otherwise if (picEL_H==2.0*picBL_H)                    offset=2.                        
In the RefIdx mode as supported by SHVC Test Model 1.0 (SHM-1.0), the decoded BL picture is up-sampled and incorporated into the long-term reference pictures list as the inter-layer reference picture. For this inter-layer reference picture, not only the texture is up-sampled from BL picture, but also the motion field is up-sampled and mapped from BL picture according to the spatial ratio of EL and BL. FIG. 3 shows an example of the motion field mapping with 1.5× spatial scalability. In this example, four smallest PUs (SPUs) in the BL (i.e., b0-b3) are mapped into nine SPUs in the EL (i.e., e0-e8). The motion fields of the nine PUs in the EL can be derived from the motion fields of the BL.
To reduce the size of the motion data buffer, the motion field in the EL is compressed with the unit size of 16×16 block after the motion mapping. In SHM-1.0, the center motion vector (as indicated by C) of a 16×16 block is used to represent the motion vector of this 16×16 block after compression, as shown in FIG. 4.
As shown in equations (5) and (6), the motion vector scaling involves quite a few operations for each motion vector. Among them, the division operation is most time consuming or most complicated. The situation is the same for inter-layer pixel position mapping as shown in equations (7) and (8). Therefore, it is desirable to develop methods to simplify motion vector scaling and pixel position mapping for inter-layer coding.