Three-dimensional (3D) television has been a technology trend in recent years that intends to bring viewers sensational viewing experience. Various technologies have been developed to enable 3D viewing. Among them, the multi-view video is a key technology for 3DTV application among others. The traditional video is a two-dimensional (2D) medium that only provides viewers a single view of a scene from the perspective of the camera. However, the multi-view video is capable of offering arbitrary viewpoints of dynamic scenes and provides viewers the sensation of realism.
The multi-view video is typically created by capturing a scene using multiple cameras simultaneously, where the multiple cameras are properly located so that each camera captures the scene from one viewpoint. Accordingly, the multiple cameras will capture multiple video sequences corresponding to multiple views. In order to provide more views, more cameras have been used to generate multi-view video with a large number of video sequences associated with the views. Accordingly, the multi-view video will require a large storage space to store and/or a high bandwidth to transmit. Therefore, multi-view video coding techniques have been developed in the field to reduce the required storage space or the transmission bandwidth.
A straightforward approach may be to simply apply conventional video coding techniques to each single-view video sequence independently and disregard any correlation among different views. Such coding system would be very inefficient. In order to improve efficiency of multi-view video coding, typical multi-view video coding exploits inter-view redundancy. Therefore, most 3D Video Coding (3DVC) systems take into account of the correlation of video data associated with multiple views and depth maps. The standard development body, the Joint Video Team of the ITU-T Video Coding Experts Group (VCEG) and the ISO/IEC Moving Picture Experts Group (MPEG), extended H.264/MPEG-4 AVC to multi-view video coding (MVC) for stereo and multi-view videos.
The MVC adopts both temporal and spatial predictions to improve compression efficiency. During the development of MVC, some macroblock-level coding tools are proposed, including illumination compensation, adaptive reference filtering, motion skip mode, and view synthesis prediction. These coding tools are proposed to exploit the redundancy between multiple views. Illumination compensation is intended for compensating the illumination variations between different views. Adaptive reference filtering is intended to reduce the variations due to focus mismatch among the cameras. Motion skip mode allows the motion vectors in the current view to be inferred from the other views. View synthesis prediction is applied to predict a picture of the current view from other views.
In the MVC, however, the depth maps and camera parameters are not coded. In the recent standardization development of new generation 3D Video Coding (3DVC), the texture data, depth data, and camera parameters are all coded. For example, FIG. 1 illustrates generic prediction structure for 3D video coding, where a standard conforming video coder is used for the base-view video. The incoming 3D video data consists of images (110-0, 110-1, 110-2, . . . ) corresponding to multiple views. The images collected for each view form an image sequence for the corresponding view. Usually, the image sequence 110-0 corresponding to a base view (also called an independent view) is coded independently by a video coder 130-0 conforming to a video coding standard such as H.264/AVC or HEVC (High Efficiency Video Coding). The video coders (130-1, 130-2, . . . ) for image sequences associated with the dependent views (i.e., views 1, 2, . . . ) further utilize inter-view prediction in addition to temporal prediction. The inter-view predictions are indicated by the short-dashed lines in FIG. 1.
In order to support interactive applications, depth maps (120-0, 120-1, 120-2, . . . ) associated with a scene at respective views are also included in the video bitstream. In order to reduce data associated with the depth maps, the depth maps are compressed using depth map coder (140-0, 140-1, 140-2, . . . ) and the compressed depth map data is included in the bit stream as shown in FIG. 1. A multiplexer 150 is used to combine compressed data from image coders and depth map coders. The depth information can be used for synthesizing virtual views at selected intermediate viewpoints. An image corresponding to a selected view may be coded using inter-view prediction based on an image corresponding to another view. In this case, the image for the selected view is referred as dependent view.
Inter-view motion prediction and inter-view residual prediction are two major coding tools in addition to inter-view texture prediction (namely disparity compensated prediction, i.e., DCP) in 3DV-HTM. The 3DV-HTM is a platform for three-dimensional video coding based on HEVC Test Model. Inter-view motion prediction as well as inter-view residual prediction needs a disparity vector to locate a reference block for either motion prediction or residual prediction. For inter-view motion prediction, the disparity vector can also be directly used as a candidate disparity vector for DCP. In the current 3DV-HTM, the disparity vector is derived based on an estimated depth map of the view. There are two methods to generate the estimated depth maps.
FIG. 2A illustrates an example of the first method to generate estimated depth maps, where the method does not use coded depth maps. In FIG. 2A, a random access unit (i.e., the POC (Picture Order Count)=0) contains texture pictures (T0-T2) and depth maps (D0-D2) of three views. The circled numbers in FIG. 2A indicate the processing order. In steps 1 and 2, a texture picture of a base view (T0) is coded and the depth map of base view D0 is coded. In step 3, the texture picture of a first dependent view (T1) is coded without inter-view motion prediction or inter-view residual prediction. In step 4, an estimated depth map of the first dependent view (PrdD1) is generated by using coded disparity vectors of the texture picture of the first dependent view (T1). In step 5, an estimated depth map of the base view (PrdD0) is generated by warping the estimated depth map of the first dependent view (PrdD1). In steps 6 and 7, the depth map of the first dependent view (D1) is coded and an estimated depth map of a second dependent view (PrdD2) is generated by warping the estimated depth map of the base view (PrdD0). In step 8, the texture picture of the second dependent view (T2) is coded with inter-view motion prediction or inter-view residual prediction using the estimated depth map of the second dependent view (PrdD2) as indicated by the dashed arrow. In step 8.5, the depth map of the second dependent view (PrdD2) is updated by using coded disparity vectors of the texture picture of the second dependent view (T2). Since the depth map of the second dependent view (PrdD2) will not be referenced any more, step 8.5 is unnecessary in this example. In step 9, the depth map of the second dependent view (D2) is coded.
FIG. 2B illustrates an example of the first method to generate estimated depth maps for the case of POC not equal to 0. In step 10, the texture picture of the base view (T0) is coded. In step 11, an estimated depth map of the base view (PrdD0) is generated by using coded motion vectors of the texture picture of the base view and the estimated depth map of the base view of the previous access unit. In step 12, the depth map of the base view (D0) is coded. In step 13, an estimated depth map of a first dependent view (PrdD1) is generated by warping the estimated depth map of the base view (PrdD0). In step 14, the texture picture of the first dependent view (T1) is coded with inter-view motion prediction or/and inter-view residual prediction using the estimated depth map of the first dependent view (PrdD1). In step 14.5, the estimated depth map of the first dependent view (PrdD1) is updated by using coded disparity vectors of the texture picture of the first dependent view (T1). In step 14.7, the estimated depth map of the base view (PrdD0) is updated by warping the estimated depth map of the first dependent view (PrdD1). In step 15, the depth map of the first dependent view (D1) is coded. In step 16, an estimated depth map of a second dependent view (PrdD2) is generated by warping the estimated depth map of the base view (PrdD0). In step 17, a texture picture of the second dependent view (T2) is coded with inter-view motion prediction or/and inter-view residual prediction using the estimated depth map of the second dependent view (PrdD2). In step 17.5, the depth map of the second dependent view (PrdD2) is updated by using coded disparity vectors of the texture picture of the second dependent view (T2). Since the depth map of the second dependent view (PrdD2) will not be referenced any more, step 17.5 is unnecessary in this example. In step 18, the depth map of the second dependent view (D2) is coded.
The second method of generating estimated depth maps, which uses coded depth maps, is described as follows. Given an access unit of multiple views, regardless of whether the access unit is a random access unit or not, a texture picture of the base view (T0) and a depth map of the base view (D0) are first coded. An estimated depth map of a first dependent view (PrdD1) is then generated by warping the coded depth map of the base view (D0). A texture picture of the first dependent view (T1) is coded with inter-view motion prediction or/and inter-view residual prediction using the estimated depth map of the first dependent view (PrdD1). After the first dependent view (T1) is coded, the depth map of the first dependent view (D1) can be coded. The steps of generating an estimated depth map, coding a texture picture, and coding a depth map for a dependent view are repeated until all dependent views are processed.
After an estimated map is derived based on either the first or the second method, a disparity vector is derived for a current block associated with a depth block of the estimated depth map. According to current 3DV-HTM (version 0.3), a depth value of a center sample of the associated depth block is converted to a disparity vector. A reference block for inter-view motion prediction is determined according to the converted disparity vector. If the reference block is coded using motion compensated prediction, the associated motion parameters can be used as candidate motion parameters for the current block of the current view. The converted disparity vector can also be directly used as a candidate disparity vector for DCP for inter-view motion prediction. For inter-view residual prediction, a residual block indicated by the converted disparity vector is used for predicting residues of the current block.
As mentioned earlier, the disparity vector is converted from the depth value of the center sample of the associated depth block in 3DV-HTM version 0.3. In 3DV-HTM version 3.1, the disparity vector is converted from the maximum depth value within the associated depth block and is used as an inter-view motion vector predictor in the advanced motion vector prediction (AMVP) scheme for Inter mode, as shown in FIG. 3. Picture 310 corresponds to a current picture in the reference view and picture 320 corresponds to a current picture in the current view. Block 322 represents a block to be processed in picture 320. The disparity vector (314) is derived based on the associated depth block (332) of the estimated depth map (330). As shown in FIG. 3, the disparity vector (314) points from a collocated block (322a) to a reference block (312) in reference picture 310. If there is a hole or an undefined sample (due to warping) in the associated depth block, the depth value of the left sample or the right sample is used. The disparity vector derived from the maximum depth value within the associated depth block is called the maximum disparity vector in this disclosure, where the maximum depth value corresponds to the nearest object. The inter-view motion vector predictor, which is indicated by the maximum disparity vector, is inserted at the first position in a candidate list of motion vector predictors in AMVP for Inter mode.
In 3DV-HTM-3.1, the process to derive the maximum disparity vector is rather computational intensive. For example, the associated depth block may correspond to a 16×16 block. To determine the maximum disparity vector may require comparing 256 depth values. It is desirable to simplify the disparity vector derivation.