In advanced video coding, such as High Efficiency Video Coding (HEVC), temporal motion parameter (e.g. motion vectors (MVs), reference index, prediction mode) is used for MV prediction. Therefore, the motion parameters from previous pictures need to be stored in a motion parameter buffer. However, the size of motion parameter buffer may become quite significant since the granularity of motion representation can be as small as 4×4. There are two motion vectors per prediction unit (PU) that need to be stored for B-slices (bi-predicted slice). On the other hand, as the picture size continues to grow, the memory issue becomes even worse since not only more motion vectors need to store, but also more bits per vector need to use for representing the motion vector. For example, the estimated storage for MVs is approximately 26 Mbits/picture for video with picture size 4 k by 2 k and the actual size will depend on the precision and maximum MVs supported.
In order to reduce the size of the motion parameter buffer, a compression technique for motion parameters is being used in systems based on high efficiency video coding (HEVC), which stores the coded motion information from previous pictures at lower spatial resolution. It uses decimation to reduce the number of motion vectors to be stored. The decimated motion vectors are associated with larger granularity instead of 4×4. The compression process for motion parameter buffer replaces the coded motion vector buffer with a reduced buffer to store motion vectors corresponding to lower spatial resolution (i.e., larger granularity). Each compressed vector is calculated as component-wise decimation.
In HEVC, the motion information compression is achieved using a decimation method as shown in FIG. 1, where each small square block consists of 4×4 pixels. In this example, the motion information compression is performed for each region consisting of 16×16 pixels (as indicated by a thick box). A representative block as indicated by a shaded area is selected and all the blocks within each 16×16 region share the same motion vectors, reference picture indices and prediction mode of the representative block. In FIG. 1, the top left 4×4 block is used as the representative block for the whole 16×16 region. In other words, 16 blocks share the same motion information. Accordingly, a 16:1 motion information compression is achieved in this example.
Three-dimensional (3D) video coding is developed for encoding/decoding video of multiple views simultaneously captured by cameras corresponding to different views. Since all cameras capture the same scene from different viewpoints, a multi-view video contains a large amount of inter-view redundancy. In order to share the previously encoded texture information of adjacent views, disparity-compensated prediction (DCP) has been added as an alternative to motion-compensated prediction (MCP). MCP refers to inter-picture prediction that uses previously coded pictures of the same view, while DCP refers to inter-picture prediction that uses previously coded pictures of other views in the same access unit. FIG. 2 illustrates an example of 3D video coding system incorporating MCP and DCP. The vector (210) used for DCP is termed as disparity vector (DV), which is analog to the motion vector (MV) used in MCP. FIG. 2 illustrates an example of three MVs (220, 230 and 240) associated with MCP. Furthermore, the DV of a DCP block can also be predicted by the disparity vector predictor (DVP) candidate derived from neighboring blocks or the temporal collocated blocks that also use inter-view reference pictures. In HTM3.1 (HEVC based test model version 3.1 for 3D video coding), when deriving an inter-view Merge candidate for Merge/Skip modes, if the motion information of a corresponding block is not available or not valid, the inter-view Merge candidate is replaced by a DV.
To share the previously coded residual information of adjacent views, the residual signal of the current block (PU) can be predicted by the residual signals of the corresponding blocks in the inter-view pictures as shown in FIG. 3. The corresponding blocks can be located by respective DVs. The video pictures and depth maps corresponding to a particular camera position are indicated by a view identifier (i.e., V0, V1 and V2 in FIG. 3). All video pictures and depth maps that belong to the same camera position are associated with the same viewId (i.e., view identifier). The view identifiers are used for specifying the coding order within the access units and detecting missing views in error-prone environments. An access unit includes all video pictures and depth maps corresponding to the same time instant. Inside an access unit, the video picture and any associated depth map having viewId equal to 0 are coded first, followed by the video picture and depth map having viewId equal to 1, etc. The view with viewId equal to 0 (i.e., V0 in FIG. 3) is also referred to as the base view or the independent view. The base view video pictures can be coded using a conventional HEVC video coder without dependence on any other view.
As can be seen in FIG. 3, motion vector predictor (MVP)/disparity vector predictor (DVP) for the current block can be derived from the inter-view blocks in the inter-view pictures. In the following, inter-view blocks in inter-view picture may be abbreviated as inter-view blocks. The derived candidates are termed as inter-view candidates, which can be inter-view MVPs or DVPs. Furthermore, a corresponding block in a neighboring view is termed as an inter-view block and the inter-view block is located using the disparity vector derived from the depth information of current block in current picture.
As described above, DV is critical in 3D video coding for disparity vector prediction, inter-view motion prediction, inter-view residual prediction, and disparity-compensated prediction (DCP) or any other coding tool that needs to indicate the correspondence between inter-view pictures.
Compressed digital video has been widely used in various applications such as video streaming over digital networks and video transmission over digital channels. Very often, a single video content may be delivered over networks with different characteristics. For example, a live sport event may be carried in a high-bandwidth streaming format over broadband networks for premium video services. In such applications, the compressed video usually preserves high resolution and high quality so that the video content is suited for high-definition devices such as an HDTV or a high resolution LCD display. The same contents may also be carried through cellular data network so that the contents can be watched on a portable device such as a smart phone or a network-connected portable media device. In such applications, due to the network bandwidth concerns as well as the typically lower resolution display on the smart phone or portable device, the video content usually is compressed into lower resolution and lower bitrates. Therefore, for different network environment and for different applications, the video resolution and video quality requirements are quite different. Even for the same type of network, users may experience different available bandwidths due to different network infrastructure and network traffic condition. Therefore, a user may desire to receive the video at higher quality when the available bandwidth is high and receive a lower-quality, but smooth, video when the network congestion occurs. In another scenario, a high-end media player can handle high-resolution and high bitrate compressed video while a low-cost media player is only capable of handling low-resolution and low bitrate compressed video due to limited computational resources. Accordingly, it is desirable to construct the compressed video in a scalable manner so that videos at different spatial-temporal resolutions and/or quality can be derived based on the same compressed bitstream.
The joint video team (JVT) of ISO/IEC MPEG and ITU-T VCEG standardizes a Scalable Video Coding (SVC) extension of the H.264/AVC standard. An H.264/AVC SVC bitstream may contain video information from low frame-rate, low resolution, and low quality to high frame rate, high definition, and high quality. This single bitstream can be adapted to various applications and displayed on devices with different configurations. Accordingly, H.264/AVC SVC is suitable for various video applications such as video broadcasting, video streaming, and video surveillance to adapt to the network infrastructure, traffic condition, user preference, etc.
In SVC, three types of scalabilities, i.e., temporal scalability, spatial scalability, and quality scalability, are provided. SVC uses multi-layer coding structure to realize the three dimensions of scalability. A main goal of SVC is to generate one scalable bitstream that can be easily and rapidly adapted to the bit-rate requirement associated with various transmission channels, diverse display capabilities, and different computational resources without trans-coding or re-encoding. An important feature of the SVC design is that the scalability is provided at a bitstream level. In other words, bitstreams for deriving video with a reduced spatial and/or temporal resolution can be simply obtained by extracting Network Abstraction Layer (NAL) units (or network packets) from a scalable bitstream. NAL units for quality refinement can be additionally truncated in order to reduce the bit-rate and the associated video quality.
For temporal scalability, a video sequence can be hierarchically coded in the temporal domain. For example, temporal scalability can be achieved using hierarchical coding structure based on B-pictures according to the H.264/AVC standard. FIG. 4 illustrates an example of hierarchical B-picture structure with 4 temporal layers and the Group of Pictures (GOP) includes eight pictures. Pictures 0 and 8 in FIG. 4 are called key pictures. Inter prediction of key pictures only uses previous key pictures as references. Other pictures between two key pictures are predicted hierarchically. The video having only the key pictures forms the coarsest temporal resolution of the scalable system. Temporal scalability is achieved by progressively refining a lower-level (coarser) video by adding more B pictures corresponding to enhancement layers of the scalable system. In the example of FIG. 4, picture 4 (in the display order) is first bi-directionally predicted using key pictures (i.e., pictures 0 and 8) after the two key pictures are coded. After picture 4 is processed, pictures 2 and 6 are processed. Picture 2 is bi-directionally predicted using pictures 0 and 4, and picture 6 is bi-directionally predicted using pictures 4 and 8. After pictures 2 and 6 are coded, remaining pictures, i.e., pictures 1, 3, 5 and 7 are processed bi-directionally using two respective neighboring pictures as shown in FIG. 4. Accordingly, the processing order for the GOP is 0, 8, 4, 2, 6, 1, 3, 5, and 7. The pictures processed according to the hierarchical process of FIG. 4 results in hierarchical four-level pictures, where pictures 0 and 8 belong to the first temporal order, picture 4 belongs the second temporal order, pictures 2 and 6 belong to the third temporal order and pictures 1, 3, 5, and 7 belong to the fourth temporal order. By decoding the base level pictures and adding higher temporal order pictures will be able to provide a higher level video. For example, base-level pictures 0 and 8 can be combined with second temporal-order picture 4 to form second-level pictures. By further adding the third temporal-order pictures to the second-level video can form the third-level video. Similarly, by adding the fourth temporal-order pictures to the third-level video can form the fourth-level video. Accordingly, the temporal scalability is achieved. If the original video has a frame rate of 30 frames per second, the base-level video has a frame rate of 30/8=3.75 frames per second. The second-level, third-level and fourth-level video correspond to 7.5, 15, and 30 frames per second. The first temporal-order pictures are also called base-level video or based-level pictures. The second temporal-order pictures through fourth temporal-order pictures are also called enhancement-level video or enhancement-level pictures. In addition to enabling temporal scalability, the coding structure of hierarchical B-pictures also improves the coding efficiency over the typical IBBP GOP structure at the cost of increased encoding-decoding delay.
In SVC, spatial scalability is supported based on the pyramid coding scheme. First, the video sequence is down-sampled to smaller pictures with coarser spatial resolutions (i.e., layers). In addition to dyadic spatial resolution, SVC also supports arbitrary resolution ratios, which is called extended spatial scalability (ESS). In order to improve the coding efficiency of the enhancement layers (video layers with coarser resolutions), the inter-layer prediction schemes are introduced. Three inter-layer prediction tools are adopted in SVC, namely inter-layer motion prediction, inter-layer intra prediction, and inter-layer residual prediction.
The inter-layer prediction process comprises identifying the collocated block in a lower layer (e.g. BL) based on the location of a corresponding EL block. The collocated lower layer block is then interpolated to generate prediction samples for the EL as shown in FIG. 5. In scalable video coding, the interpolation process is used for inter-layer prediction by using predefined coefficients to generate the prediction samples for the EL based on a lower layer pixels. The example in FIG. 5 consists of two layers. However, an SVC system may consist of more than two layers. The BL picture is formed by applying spatial decimation 510 to the input picture (i.e., an EL picture in this example). The BL processing comprises BL prediction 520. The BL input is predicted by BL prediction 520, where subtractor 522 is used to form the difference between the BL input data and the BL prediction. The output of subtractor 522 corresponds to the BL prediction residues and the residues are processed by transform/quantization (T/Q) 530 and entropy coding 570 to generate compressed bitstream for the BL. Reconstructed BL data has to be generated at the BL in order to form BL prediction. Accordingly, inverse transform/inverse quantization (IT/IQ) 540 is used to recover the BL residues. The recovered BL residues and the BL prediction data are combined using reconstruction 550 to form reconstructed BL data. The reconstructed BL data is processed by in-loop filter 560 before it is stored in buffers inside the BL prediction. In the BL, BL prediction 520 uses Inter/Intra prediction 521. The EL processing consists of similar processing modules as the BL processing. The EL processing comprises EL prediction 525, subtractor 528, T/Q 535, entropy coding 575, IT/IQ 545, reconstruction 555 and in-loop filter 565. However, the EL prediction also utilizes reconstructed BL data as part of inter-layer prediction. Accordingly, EL prediction 525 comprises inter-layer prediction 527 in addition to Inter/Intra prediction 526. The reconstructed BL data is interpolated using interpolation 512 before it is used for inter-layer prediction. The compressed bitstreams from the BL and the EL are combined using multiplexer 580 to form a scalable bitstream.
In EL coding unit coding, a flag can be coded to indicate whether the EL motion information is directly derived from the BL. If the flag is equal to 1, the partitioning data of the EL coding unit together with the associated reference indexes and motion vectors are derived from the corresponding data of the collocated block in the BL. The reference picture index of BL is directly used in EL. The coding unit partitioning and motion vectors of EL correspond to the scaled coding unit partitioning and scaled motion vectors of the BL respectively. Besides, the scaled BL motion vector can be used as an additional motion vector predictor for the EL.
As illustrated in the above discussion, the DV and MV information is used for inter-view predictive coding in three-dimensional video coding systems and inter-layer predictive coding in scalable video coding systems. The DV and MV information may have to be stored for one or more reference pictures or depth maps. Therefore, the amount of storage may be substantial. It is desirable to reduce the required MV/DV information storage.