Three-dimensional (3D) television has been a technology trend in recent years that intends to bring viewers sensational viewing experience. Various technologies have been developed to enable 3D viewing. The multi-view video is a key technology for 3DTV application among others. The traditional video is a two-dimensional (2D) medium that only provides viewers a single view of a scene from the perspective of the camera. However, the multi-view video is capable of offering arbitrary viewpoints of dynamic scenes and provides viewers the sensation of realism.
The multi-view video is typically created by capturing a scene using multiple cameras simultaneously, where the multiple cameras are properly located so that each camera captures the scene from one viewpoint. Accordingly, the multiple cameras will capture multiple video sequences corresponding to multiple views. In order to provide more views, more cameras have been used to generate multi-view video with a large number of video sequences associated with the views. Accordingly, the multi-view video will require a large storage space to store and/or a high bandwidth to transmit. Therefore, multi-view video coding techniques have been developed in the field to reduce the required storage space or the transmission bandwidth.
Various techniques to improve the coding efficiency of 3D video coding have been disclosed in the field. There are also development activities to standardize the coding techniques. For example, a working group, ISO/IEC JTC1/SC29/WG11 within ISO (International Organization for Standardization) is developing an HEVC (High Efficiency Video Coding) based 3D video coding standard. In HEVC, the motion information of the temporal motion parameters (e.g. motion vectors (MVs), reference index and prediction mode) can be used for MV prediction. Therefore, the motion parameters from previous pictures need to be stored in a motion parameters buffer. However, the size of motion parameters buffer may become quite significant because the granularity of motion representation is at 4×4 block size. There are two motion vectors for each prediction unit (PU) in the B-slices (bi-predicted slice). In order to reduce the size of the motion parameters buffer, a motion compression process, named motion data storage reduction (MDSR), is utilized to store the decoded motion information from previous pictures at lower resolution. During encoding or decoding process, the decoded motion information associated with a current frame is used to reconstruct a current frame. After the current frame is reconstructed, the motion information is stored at coarser granularity for other frames to reference.
In HEVC, the reduction of motion information buffer is achieved by a decimation method. FIG. 1 shows an example of motion data storage reduction based on decimation. In this example, the motion data compression is conducted for each 16×16 block. All 4×4 blocks within the 16×16 block share the same motion vectors, reference picture indices and prediction mode of the representative block. In the HEVC standard, the top-left 4×4 block (i.e., block 0) is used as the representative block for the whole 16×16 block. For convenience, each 16×16 block is referred as a motion sharing area in this disclosure since all the smallest blocks within the 16×16 block share the same motion parameters. While 16×16 block size is being used in the HEVC standard, the motion sharing area may have other block sizes.
In the international coding standard development, three-dimensional video coding and scalable video coding are two possible extensions to the conventional two-dimensional HEVC video coding standard. FIG. 2 shows an exemplary prediction structure used in the HEVC-based 3D video coding Version 4.0 (HTM-4.0). The video pictures (210A) and depth maps (210B) corresponding to a particular camera position are indicated by a view identifier (viewID). For example, video pictures and depth maps associated with three views (i.e., V0, V1 and V2) are shown in FIG. 2. All video pictures and depth maps that belong to the same camera position are associated with the same viewId. The video pictures and, when present, the depth maps are coded access unit (AU) by access unit, as shown in FIG. 2. An AU (220) includes all video pictures and depth maps corresponding to the same time instant. In HTM-4.0, the motion data compression is performed for each picture after all the pictures (both texture and depth) within the same AU are coded. In this case, for each AU, the reconstruction process for pictures within the AU can rely on full-resolution motion data associated with the current AU. The motion data compression will only affect the reconstruction process of other Ails that refer the compressed motion data associated with the current AU.
As for scalable video coding (SVC), three types of scalabilities including temporal scalability, spatial scalability and quality scalability are being considered for scalable extension of HEVC. SVC uses the multi-layer coding structure to realize three dimensions of scalability. The prediction structure can be similar to that for 3D video coding, where the inter-view prediction (i.e., prediction in the view direction) is replaced by inter-layer dimension (i.e., prediction in the layer direction). Furthermore, in SVC, only texture information is involved and there is no depth map.
FIG. 3 illustrates an exemplary three-layer SVC system, where the video sequence is first down-sampled to obtain smaller pictures at different spatial resolutions (layers). For example, picture 310 at the original resolution can be processed by spatial decimation 320 to obtain resolution-reduced picture 311. The resolution-reduced picture 311 can be further processed by spatial decimation 321 to obtain further resolution-reduced picture 312 as shown in FIG. 3. The SVC system in FIG. 3 illustrates an example of spatial scalable system with three layers, where layer 0 corresponds to the pictures with lowest spatial resolution and layer 2 corresponds to the pictures with the highest resolution. The layer-0 pictures are coded without reference to other layers, i.e., single-layer coding. For example, the lowest layer picture 312 is coded using motion-compensated and Intra prediction 330. In FIG. 3, while spatial scalability is achieved using spatial decimation, quality scalability is achieved by using SNR (Signal to Noise Ratio) enhancement. The temporal scalability can be achieved using techniques such as hierarchical temporal picture structure.
The motion-compensated and Intra prediction 330 will generate syntax elements as well as coding related information such as motion information for further entropy coding 340. FIG. 3 actually illustrates a combined SVC system that provides spatial scalability as well as quality scalability (also called SNR scalability). For each single-layer coding, the residual coding errors can be refined using SNR enhancement layer coding 350. The SNR enhancement layer in FIG. 3 may provide multiple quality levels (quality scalability). Each supported resolution layer could be coded by respective single-layer motion-compensated and Intra prediction similar to a non-scalable coding system. Each higher spatial layer may also be coded using inter-layer coding based on one or more lower spatial layers. For example, spatial layer 1 video can be adaptively coded using inter-layer prediction based on layer 0 video or a single-layer coding. Similarly, spatial layer 2 video can be adaptively coded using inter-layer prediction based on reconstructed spatial layer 1 video or a single-layer coding. As shown in FIG. 3, spatial layer-1 pictures 311 can be coded by motion-compensated and Intra prediction 331, base layer entropy coding 341 and SNR enhancement layer coding 351. As shown in FIG. 3, the reconstructed BL video data is also utilized by motion-compensated and Intra prediction 331, where a coding block in spatial layer 1 may use the reconstructed BL video data as an additional Intra prediction data (i.e., no motion compensation is involved). Similarly, layer-2 pictures 310 can be coded by motion-compensated and Intra prediction 332, base layer entropy coding 342 and SNR enhancement layer coding 352. The BL bitstreams and SNR enhancement layer bitstreams from all spatial layers are multiplexed by multiplexer 360 to generate a scalable bitstream.
As mentioned before, the motion vector compression in HTM-4.0 is performed for each picture after all pictures (both texture and depth) within the same AU are coded. Therefore, the motion information associated with all pictures (both texture and depth) within the same AU has to be buffered temporarily before motion vector compression is performed. FIG. 4 illustrates motion data buffer requirement according to HTM-4.0. The video pictures (T0, T1 and T2) and depth maps (D0, D1 and D2) are associated with AU 0 (410). The full-resolution motion information is stored in motion data buffer 420, where block 420A corresponds to motion data associated with picture T0 and block 420B corresponds to motion data associated with depth map D0. After all texture pictures and depth maps in AU 0 are coded, the full-resolution motion information is compressed to 1/16-resolution motion data (430), where block 430A corresponds to compressed motion data associated with picture T0 and block 430B corresponds to compressed motion data associated with depth map D0. When a 3D sequence involves a large number of views, the required motion data buffer may be quite sizeable. Therefore, it is desirable to develop techniques for 3DVC to reduce the motion data buffer requirement. Similarly, it is desirable to reduce the required motion data buffer for SVC with minor coding performance drop compared to storing the motion data at full resolution. For SVC, a set of images across all layers can be considered as an equivalent AU in 3DVC. For example, a set of pyramid images associated with a time instance can be considered as an AU in order to unify the discussion in the disclosure.