Three-dimensional (3D) television has been a technology trend in recent years that intends to bring viewers sensational viewing experience. Various technologies have been developed to enable 3D viewing. Among them, the multi-view video is a key technology for 3DTV application among others. The traditional video is a two-dimensional (2D) medium that only provides viewers a single view of a scene from the perspective of the camera. However, the multi-view video is capable of offering arbitrary viewpoints of dynamic scenes and provides viewers the sensation of realism.
The multi-view video is typically created by capturing a scene using multiple cameras simultaneously, where the multiple cameras are properly located so that each camera captures the scene from one viewpoint. Accordingly, the multiple cameras will capture multiple video sequences corresponding to multiple views. In order to provide more views, more cameras have been used to generate multi-view video with a large number of video sequences associated with the views. Accordingly, the multi-view video will require a large storage space to store and/or a high bandwidth to transmit. Therefore, multi-view video coding techniques have been developed in the field to reduce the required storage space or the transmission bandwidth.
A straightforward approach may be to simply apply conventional video coding techniques to each single-view video sequence independently and disregard any correlation among different views. Such coding system would be very inefficient. In order to improve efficiency of multi-view video coding, typical multi-view video coding exploits inter-view redundancy. Therefore, most 3D Video Coding (3DVC) systems take into account of the correlation of video data associated with multiple views and depth maps. The standard development body, the Joint Video Team of the ITU-T Video Coding Experts Group (VCEG) and the ISO/IEC Moving Picture Experts Group (MPEG), extended H.264/MPEG-4 AVC to multi-view video coding (MVC) for stereo and multi-view videos.
The MVC adopts both temporal and spatial predictions to improve compression efficiency. During the development of MVC, some macroblock-level coding tools are proposed, including illumination compensation, adaptive reference filtering, motion skip mode, and view synthesis prediction. These coding tools are proposed to exploit the redundancy between multiple views. Illumination compensation is intended for compensating the illumination variations between different views. Adaptive reference filtering is intended to reduce the variations due to focus mismatch among the cameras. Motion skip mode allows the motion vectors in the current view to be inferred from the other views. View synthesis prediction is applied to predict a picture of the current view from other views.
In the MVC, however, the depth maps and camera parameters are not coded. In the recent standardization development of new generation 3D Video Coding (3DVC), the texture data, depth data, and camera parameters are all coded. For example, FIG. 1 illustrates generic prediction structure for 3D video coding, where a standard conforming video coder is used for the base-view video. The incoming 3D video data consists of images (110-0, 110-1, 110-2, . . . ) corresponding to multiple views. The images collected for each view form an image sequence for the corresponding view. Usually, the image sequence 110-0 corresponding to a base view (also called an independent view) is coded independently by a video coder 130-0 conforming to a video coding standard such as H.264/AVC or HEVC (High Efficiency Video Coding). The video coders (130-1, 130-2, . . . ) for image sequences associated with the dependent views (i.e., views 1, 2, . . . ) further utilize inter-view prediction in addition to temporal prediction. The inter-view predictions are indicated by the short-dashed lines in FIG. 1.
In order to support interactive applications, depth maps (120-0, 120-1, 120-2, . . . ) associated with a scene at respective views are also included in the video bitstream. In order to reduce data associated with the depth maps, the depth maps are compressed using depth map coder (140-0, 140-1, 140-2, . . . ) and the compressed depth map data is included in the bit stream as shown in FIG. 1. A multiplexer 150 is used to combine compressed data from image coders and depth map coders. The depth information can be used for synthesizing virtual views at selected intermediate viewpoints. An image corresponding to a selected view may be coded using inter-view prediction based on an image corresponding to another view. In this case, the image for the selected view is referred as dependent view.
In the reference software for HEVC based 3D video coding version 3.1 (HTM3.1), inter-view candidate is added as a motion vector (MV) or disparity vector (DV) candidate for Inter (i.e., Temporal), Merge and Skip mode in order to re-use previously coded motion information of adjacent views. In HTM3.1, the basic unit for compression, termed as coding unit (CU), is a 2N×2N square block. Each CU can be recursively split into four smaller CUs until a predefined minimum size is reached. Each CU contains one or more prediction units (PUs). In the 3DV-HTM, the inter-view candidate derivation process involves a pruning process, i.e., removing redundant candidates. The pruning process is only applied to the spatial candidates in Inter, Merge and Skip mode. The pruning process is applied to neither the temporal candidates nor the inter-view candidates. The Merge candidate derivation process is shown in FIG. 2.
As shown in FIG. 2, the pruning process involves a small number of parallel motion information comparisons between the spatial candidates. For example, spatial candidates 1-4 (211-214) are pruned to provide a reduced number of spatial candidate or candidates to the Merge candidate list (250) as shown in FIG. 2. The temporal and inter-view candidates are exempted from the pruning process in the Merge candidate derivation process. In other words, the inter-view candidate and the temporal candidate are always included in the pruned candidate list. The motion information of a spatial candidate is inserted into the Merge list depending on a specific condition on this spatial candidate. The pruning process always retains A1 (shown in FIG. 3) in the list if a motion vector is available for A1. The conditions for these spatial candidates to be excluded from the Merge candidate list are as follows (shown in FIG. 3):                B1: B1 has the same motion information as A1 (indicated by arrow 310)        B0: B0 has the same motion information as B1 (indicated by arrow 320)        A0: A0 has the same motion information as A1 (indicated by arrow 330)        B2: B2 has the same motion information as A1 (indicated by arrow 340) or has the same motion information as B1 (indicated by arrow 350). B2 is checked only if any of A1, B1, B0 or A0 is excluded from the Merge list.        
The locations of the spatial neighboring blocks are shown in FIG. 3, where the spatial neighboring block set includes the location diagonally across from the lower-left corner of the current block (i.e., A0), the location next to the left-bottom side of the current block (i.e., A1), the location diagonally across from the upper-left corner of the current block (i.e., B2), the location diagonally across from the upper-right corner of the current block (i.e., B0), and the location next to the top-right side of the current block (i.e., B1). When the block designations (i.e., B0, B1, B2, A0 and A1) are mentioned above, the block designation may refer to the motion vector or motion vector predictor associated with the block for convenience. For example, “A1 is available” implies “the motion vector of A1 is available”. In HTM 3.1, the candidate set for the Inter mode includes one inter-view predictor (candidate), two spatial predictors (candidates) and one temporal predictor (candidate):                1. Inter-view predictor (candidate),        2. 1st spatial predictor (candidate),        3. 2nd spatial predictor (candidate), and        4. Temporal predictor (candidate)        
The two spatial candidates in HTM 3.1 correspond to A1 block next to the left-bottom side of the current block and block B1 next to the top-right side of the current block. The inter-view predictor (candidate) is the motion vector of the corresponding block in the inter-view picture or the disparity vector derived from the depth map. A temporal predictor (candidate) is derived from a block (TBR or TCTR) located in a collocated picture. In HTM v3.1, only when the number of available inter-view and spatial predictors equals 2, the pruning process is applied to compare these two predictors and to remove the redundant one. The temporal predictor is exempted from the pruning process. After the pruning process, only the first three available predictors are included in the candidate set. If the number of available predictors is smaller than 3, the zero predictor (240) is inserted as shown in FIG. 2.
If the total number of candidates in the Merge candidate list is less than a list size (e.g., 5), one or more combined motion vector are added as additional candidates. The combined motion vector is generated from the pruned spatial candidates from pruning process 220 by using combined MVP 230. For example, a bi-predictive Merge candidate can be formed by combining a MV candidate pointing a reference picture in List 0 and another MV candidate pointing to a reference picture in List 1.
It is desirable to develop a pruning process on inter-view candidate, spatial candidates and temporal candidate that may lead to improved performance such as RD-rate or reduced computation time or memory storage.
As illustrated in the above discussion, the candidate set derivation process involves various spatial and temporal neighboring blocks. It is desirable to reduce the complexity of the candidate set derivation without noticeable impact on system performance.