Three-dimensional (3D) television has been a technology trend in recent years to bring viewers sensational viewing experience. Various technologies have been developed to enable 3D viewing. Among them, the multi-view video is a key technology for 3DTV application among others. The traditional video is a two-dimensional (2D) medium that only provides viewers a single view of a scene from the perspective of the camera. However, the multi-view video is capable of offering arbitrary viewpoints of dynamic scenes and provides viewers the sensation of realism.
The multi-view video is typically created by capturing a scene using multiple cameras simultaneously, where the multiple cameras are properly located so that each camera captures the scene from one viewpoint. Accordingly, the multiple cameras will capture multiple video sequences corresponding to multiple views. In order to provide more views, more cameras have been used to generate multi-view video with a large number of video sequences associated with the views. Accordingly, the multi-view video will require a large storage space to store and/or a high bandwidth to transmit. Therefore, multi-view video coding techniques have been developed in the field to reduce the required storage space or the transmission bandwidth.
A straightforward approach may be to simply apply conventional video coding techniques to each single-view video sequence independently and disregard any correlation among different views. Such coding system would be very inefficient. In order to improve efficiency of multi-view video coding, typical multi-view video coding exploits inter-view redundancy. Therefore, most 3D Video Coding (3DVC) systems take into account of the correlation of video data associated with multiple views and depth maps.
In the reference software for HEVC based 3D video coding (3D-HTM), inter-view candidate is added as a motion vector (MV) or disparity vector (DV) candidate for Inter, Merge and Skip mode in order to re-use previously coded motion information of adjacent views. In 3D-HTM, the basic unit for compression, termed as coding unit (CU), is a 2N×2N square block. Each CU can be recursively split into four smaller CUs until a predefined minimum size is reached. Each CU contains one or more prediction units (PUs).
To share the previously coded texture information of adjacent views, a technique known as Disparity-Compensated Prediction (DCP) has been included in 3D-HTM as an alternative coding tool to motion-compensated prediction (MCP). MCP refers to an inter-picture prediction that uses previously coded pictures of the same view, while DCP refers to an inter-picture prediction that uses previously coded pictures of other views in the same access unit. FIG. 1 illustrates an example of 3D video coding system incorporating MCP and DCP. The vector (110) used for DCP is termed as disparity vector (DV), which is analog to the motion vector (MV) used in MCP. FIG. 1 illustrates three MVs (120, 130 and 140) associated with MCP. Moreover, the DV of a DCP block can also be predicted by the disparity vector predictor (DVP) candidate derived from neighboring blocks or the temporal collocated blocks that also use inter-view reference pictures. In 3D-HTM, when deriving an inter-view Merge candidate for Merge/Skip modes, if the motion information of corresponding block is not available or not valid, the inter-view Merge candidate is replaced by a DV.
Inter-view residual prediction is another coding tool used in 3D-HTM. To share the previously coded residual information of adjacent views, the residual signal of the current prediction block (i.e., PU) can be predicted by the residual signals of the corresponding blocks in the inter-view pictures as shown in FIG. 2. The corresponding blocks can be located by respective DVs. The video pictures and depth maps corresponding to a particular camera position are indicated by a view identifier (i.e., V0, V1 and V2 in FIG. 2). All video pictures and depth maps that belong to the same camera position are associated with the same viewId (i.e., view identifier). The view identifiers are used for specifying the coding order within the access units and detecting missing views in error-prone environments. An access unit includes all video pictures and depth maps corresponding to the same time instant. Inside an access unit, the video picture and, when present, the associated depth map having viewId equal to 0 are coded first, followed by the video picture and depth map having viewId equal to 1, etc. The view with viewId equal to 0 (i.e., V0 in FIG. 2) is also referred to as the base view or the independent view. The base view video pictures can be coded using a conventional HEVC video coder without dependence on other views.
As can be seen in FIG. 2, for the current block, motion vector predictor (MVP)/disparity vector predictor (DVP) can be derived from the inter-view blocks in the inter-view pictures. In the following, inter-view blocks in inter-view picture may be abbreviated as inter-view blocks. The derived candidate is termed as inter-view candidates, which can be inter-view MVPs or DVPs. The coding tools that codes the motion information of a current block (e.g., a current prediction unit, PU) based on previously coded motion information in other views is termed as inter-view motion parameter prediction. Furthermore, a corresponding block in a neighboring view is termed as an inter-view block and the inter-view block is located using the disparity vector derived from the depth information which is in association with the current block in current picture.
The example shown in FIG. 2 corresponds to a view coding order from V0 (i.e., base view) to V1, and followed by V2. The current block in the current picture being coded is in V2. According to 3D-HTM, all the MVs of reference blocks in the previously coded views can be considered as an inter-view candidate. In FIG. 2, frames 210, 220 and 230 correspond to a video picture or a depth map from views V0, V1 and V2 at time t1 respectively. Block 232 is the current block in the current view, and blocks 212 and 222 are the current blocks in V0 and V1 respectively. For current block 212 in V0, a disparity vector (216) is used to locate the inter-view collocated block (214). Similarly, for current block 222 in V1, a disparity vector (226) is used to locate the inter-view collocated block (224). According to 3D-HTM, the motion vectors or disparity vectors associated with inter-view collocated blocks from any coded views can be included in the inter-view candidates.
In 3DV-HTM, a disparity vector can be used as a DVP candidate for Inter mode or as a Merge candidate for Merge/Skip mode. A derived disparity vector can also be used as an offset vector for inter-view motion prediction and inter-view residual prediction. When used as an offset vector, the DV is derived from spatial and temporal neighboring blocks as shown in FIG. 3A and FIG. 3B. Multiple spatial and temporal neighboring blocks are determined and DV availability of the spatial and temporal neighboring blocks is checked according to a pre-determined order. This coding tool for DV derivation based on neighboring (spatial and temporal) blocks is termed as Neighboring Block DV (NBDV). As shown in FIG. 3A, the spatial neighboring block set includes the location diagonally across from the lower-left corner of the current block (i.e., lower-left block, A0), the location next to the left-bottom side of the current block (i.e., left-bottom block, A1), the location diagonally across from the upper-left corner of the current block (i.e., upper-left block, B2), the location diagonally across from the upper-right corner of the current block (i.e., upper-left block, B0), and the location next to the top-right side of the current block (i.e., top-right block, B1). As shown in FIG. 3B, the temporal neighboring block set includes the location at the center of the current block (i.e., BCTR) and the location diagonally across from the right-bottom corner of the current block (i.e., right-bottom block, RB) in a temporal reference picture. As shown in FIG. 3B, the current block is located at the upper-left location of the center point P. Instead of the center location, other locations (e.g., a lower-right block) within the current block in the temporal reference picture may also be used. In other words, any block collocated with the current block can be included in the temporal block set. Once a block is identified as having a DV, the checking process will be terminated. An exemplary search order for the spatial neighboring blocks in FIG. 3A is (A1, B1, B0, A0, B2). An exemplary search order for the temporal neighboring blocks for the temporal neighboring blocks in FIG. 3B is (BR, BCTR). The spatial and temporal neighboring blocks are the same as the spatial and temporal neighboring blocks of Inter mode (AMVP) and Merge modes in HEVC.
If a DCP coded block is not found in the neighboring block set (i.e., spatial and temporal neighboring blocks as shown in FIGS. 3A and 3B), the disparity information can be obtained from another coding tool (DV-MCP). In this case, when a spatial neighboring block is MCP coded block and its motion is predicted by the inter-view motion prediction, as shown in FIG. 4, the disparity vector used for the inter-view motion prediction represents a motion correspondence between the current and the inter-view reference picture. This type of motion vector is referred to as inter-view predicted motion vector and the blocks are referred to as DV-MCP blocks. FIG. 4 illustrates an example of a DV-MCP block, where the motion information of the DV-MCP block (410) is predicted from a corresponding block (420) in the inter-view reference picture. The location of the corresponding block (420) is specified by a disparity vector (430). The disparity vector used in the DV-MCP block represents a motion correspondence between the current and inter-view reference picture. The motion information (422) of the corresponding block (420) is used to predict motion information (412) of the current block (410) in the current view.
A method to enhance the NBDV by extracting a more accurate disparity vector (referred to as a refined DV in this disclosure) from the depth map is utilized in current 3D-HEVC. A depth block from coded depth map in the same access unit is first retrieved and used as a virtual depth of the current block. This coding tool for DV derivation is termed as Depth-oriented NBDV (DoNBDV). While coding the texture in view 1 and view 2 with the common test condition, the depth map in view 0 is already available. Therefore, the coding of texture in view 1 and view 2 can be benefited from the depth map in view 0. An estimated disparity vector can be extracted from the virtual depth shown in FIG. 5. The overall flow is as following:                1. Use an estimated disparity vector, which is the NBDV in current 3D-HTM, to locate the corresponding block in the coded texture view        2. Use the collocated depth in the coded view for current block (coding unit) as virtual depth.        3. Extract a disparity vector (i.e., a refined DV) for inter-view motion prediction from the maximum value in the virtual depth retrieved in the previous step.        
In the example illustrated in FIG. 5, the coded depth map in view 0 is used to derive the DV for the texture frame in view 1 to be coded. A corresponding depth block (530) in the coded D0 is retrieved for the current block (CB, 510) according to the estimated disparity vector (540) and the location (520) of the current block of the coded depth map in view 0. The retrieved block (530) is then used as the virtual depth block (530′) for the current block to derive the DV. The maximum value in the virtual depth block (530′) is used to extract a disparity vector for inter-view motion prediction.
In 3D-HEVC, a basic unit for compression, termed coding tree unit (CTU) or also termed largest coding unit (LCU), is a 2N×2N square block, and each CTU can be recursively split into four smaller CUs until the predefined minimum size is reached. For determining the best CU size, the rate-distortion optimization (RDO) is often used, which is well known in the field of video coding. When encoding a CU, the rate-distortion (RD) costs for different PU types including Inter/Merge/Skip 2N×2N, Inter/Merge 2N×N, Inter/Merge N×2N, Inter/Merge N×N, Inter/Merge 2N×nU, Inter/Merge 2N×nD, Inter/Merge nL×2N, Inter/Merge nR×2N, Intra 2N×2N and Intra N×N, are examined. The RD costs for Inter/Merge N×N and Intra N×N are examined only for 8×8 CU. For each Inter PU type, motion estimation and motion compensation have to be performed to derived motion-compensated residues for RD cost evaluation. For Merge mode, the motion information is determined from the motion information of neighboring blocks. Therefore, Merge mode is more computationally efficient since the motion estimation is not performed. As it is well known in video coding that motion estimation is very computationally intensive. An exemplary encoding process is shown in FIG. 6 for a texture CU in 3D-HTM, where the RD performance for various coding modes is checked in steps 612 through 632. As mentioned earlier, the RD costs for Inter/Merge N×N and Intra N×N are examined only for 8×8 CU. In other words, steps 614 and 630 will be performed only for N=8. After a best mode is selected for a given CU, the final CUs (i.e., leaf CUs) are compressed using one of the compress CU process (640a-d).
A CU split quadtree (QT) is determined for the texture at the encoder. Also, a splitting QT is determined for the depth at the encoder. The structure of the QTs has to be incorporated in the bitstream so that a decoder can recover the structure of the QTs. In order to reduce bits and encoding runtime, the current HTM adopts an approach where the depth QT uses the texture QT as a predictor. For a given CTU, the quadtree of the depth is linked to the collocated quadtree of the texture, so that a given CU of the depth cannot be split more than its collocated CU in the texture. One example is illustrated in FIG. 7, where block 710 corresponds to a QT for the texture CTU and block 720 corresponds to a depth CTU. As shown in FIG. 7, some partitions in the texture CTU are not performed for the depth CTU (indicated by 722 and 724). Simplification of rectangular partition can be also performed. For example, when a texture block is partitioned into 2N×N or N×2N, the corresponding depth block will not allow 2N×N, N×2N and N×N partitioning. With this additional constraint on the depth partitioning, the depth partition in FIG. 7 is illustrated in block 820 of FIG. 8, where the partition 822 is not allowed.
As shown in FIG. 6, the current encoding scheme is fairly computationally intensively. It is desirable to develop techniques to reduce the encoding complexity associated with mode decision and coding tree partition.