The present invention relates to an apparatus and a method for coding a bit stream representing a three-dimensional (3D) video.
The 3D Video is a new technology, which requires the transmission of depth data alongside the conventional 2D video data to allow for more flexibility at the receiver side. The additional depth information allows to synthesize arbitrary viewpoints which then enables adaptation of the perceived depth impression and driving of multi-view auto-stereoscopic displays. By adding depth information to every transmitted view the amount of data to be coded increases significantly. Compared to conventional, natural video depth maps are characterized by piecewise smooth regions bounded by sharp edges along depth discontinuities. Using conventional video coding approaches to compress depth maps results in strong ringing artifacts along these depth discontinuities, which lead to visually disturbing geometric distortions in the view synthesis process. Preserving the described signal characteristics of depth maps is therefore a crucial requirement for new depth coding algorithms.
Recent developments in the field of 3D display technologies like for auto-stereoscopic displays or stereo displays, which allow to adapt the depth impression to the viewer's personal preference, require to synthesize additional arbitrary views based on the limited number of available decoded views. To allow for this extend of flexibility, depth information needs to be available at the receiver side and consequently needs to be coded in addition to the conventional 2D video data. These additional depth maps show different signal characteristics compared to natural video data. Moreover, distortions in depth maps have an indirect impact on the visual quality of the displayed video as they are used to synthesize new views of the same scene and are never shown to the user themselves. Compressing depth maps with algorithms optimized for natural 2D videos results in strong ringing artifacts along depth discontinuities, which then produce geometric distortions in the synthesized views.
Previous work on compression of depth data regarded depth data as gray-colored video and compressed it with conventional transform-based video coding algorithms as found in H.264/AVC, e.g. “P. Merkle, A. Smolic, K. Muller, and T. Wiegand, “Multi-view video plus depth representation and coding,” 14th IEEE International Conference on Image Processing (ICIP). IEEE, 2007, pp. 1201-1204 “. It was shown that these conventional coding tools yield relatively high compression efficiency in terms of PSNR, but at the same time introduce ringing artifacts along sharp edges in the original depth maps. These artifacts result in geometric distortions in the view synthesis stage. More recent depth compression algorithms approximate the depth map's signal characteristics by partitioning into triangular meshes as described in “M. Sarkis, W. Zia, and K. Diepold, “Fast depth map compression and meshing with compressed tritree,” Computer Vision—ACCV 2009, pp. 44-55, 2010” or platelets as described in “Y. Morvan, P. de With, and D. Farin, “Platelet-based coding of depth maps for the transmission of multiview images,” in Proceedings of SPIE, Stereoscopic Displays and Applications, vol. 6055, 2006, pp. 93-100,” and modeling each segment by an appropriate 2D function. These pure model-based approaches can also be combined with conventional transform based tools by introducing an additional coding mode, like the sparse-dyadic mode described in “S. Liu, P. Lai, D. Tian, C. Gomila, and C. Chen, “Sparse dyadic mode for depth map compression,” in 17th IEEE International Conference on Image Processing (ICIP). IEEE, 2010, pp. 3421-3424.” Here, a sparse-dyadic-coded block is partitioned into two segments, which are described by two constant depth values. As the preservation of depth discontinuities is the most important when compressing depth maps, another approach is to losslessly compress the location of these discontinuities and approximate the piecewise smooth regions, as previously proposed in “F. Jager, “Contour-based segmentation and coding for depth map compression,” in Visual Communications and Image Processing (VCIP), 2011 IEEE, pp. 1-4.” The disadvantage of this approach is the inability of reaching low bitrates due to the lossless encoding of depth contours.
In summary, when coding depth maps with conventional algorithms optimized for textured video data, ringing artifacts along depth discontinuities are introduced due to transform and quantization. Typical depth map characteristics such as piecewise smooth regions bounded by strong edges need to be coded differently to allow for high quality view synthesis at the receiver. Conventional coding algorithms use advanced prediction methods like directional intra prediction and planar modes. These are able to approximate edges and gradients of depth maps to a certain extent. The directional prediction modes lack the ability to approximate edges, which are not continued from the top-right of the current coding unit. Moreover, the already known planar mode is unable to represent coding units, which are only partially characterized by a depth gradient as they contain two different depth segments.