The present invention relates to intermediate view synthesis and multi-view data signal extraction/construction.
3D video (3DV) provides the viewer with a depth perception of the observed scenery. This is also referred to as stereo, which is, however, a term too restricted to the classical technology of using 2 videos. Recently, 3DV gains rapidly increasing attention spanning systems and applications from mobile phones to 3D cinema [25]. Technology is maturating covering the whole processing chain from camera systems to 3D displays. Awareness and interest is growing on consumer side, who wish to experience the extended visual sensation, as well as on business side including content providers, equipment producers and distributors.
Creating a 3D depth impression entails that a viewer looking at a 3D display sees a different view with each eye. These views should correspond to images taken from different viewpoints with human eye distance. In other words, providing the user with a natural depth impression of the observed scenery may involve using specific 3D viewing technology, which ensure that each eye only sees one image of a stereo pair presented simultaneously [17]. In the past, users had to wear specific glasses (anaglyph, polarization, shutter). Together with limited visual quality this is regarded as main obstacle for wide success of 3DV systems in home user environments, while other types of applications such as 3D cinema are expected to grow rapidly over the next years due to their high visual quality. To be more precise, a 3D display emits two or more views at the same time and ensures that a viewer sees such a stereo pair from a certain viewpoint [17]. Specific glasses based on anaglyph, polarization, or shutter technology may have been used to achieve this in the past but are today still appropriate for a wide range of applications. For instance, 3D cinema applications based on glasses (such as IMAX® theatres) are well established. In a cinema theatre the user is sitting in a chair without much possibility to move and is usually paying almost full attention to the presented movie. Wearing glasses is widely accepted in such a scenario and motion parallax is not a big issue. 3D cinema with display technology based on glasses is therefore expected to remain the standard over the next years. This market is expected to grow further and more and more movies are produced in 2D for classical cinema as well as in a 3D version for 3D enabled theatres. It is expected that this will broaden awareness of users and with this also the acceptance and create demand for 3DV applications in the home.
In a living room environment, however, the user expectations are very different. The necessity to wear glasses is considered as a main obstacle for success of 3D video in home user environments. Now this is overcome with multiview autostereoscopic displays [17]. Several images are emitted at the same time but the technology ensures that users only see a stereo pair from a specific viewpoint. 3D displays are on the market today that are capable of showing 9 or more different images at the same time, of which only a stereo pair is visible from a specific viewpoint. With this multi-user 3D sensation without glasses is enabled for instance in a living room. A group of people may enjoy a 3D movie in the familiar sofa-TV environment without glasses but with all social interactions that we are used to. When moving around a natural motion parallax impression can be supported if consecutive views are arranged properly as stereo pairs.
However, transmitting 9 or more views of the same 3D scenery from slightly different viewpoints to the home user is extremely inefficient. The transmission costs would not justify the additional value. Fortunately, alternative 3D video formats allow for reducing the raw data rate significantly. When using the multiview video plus depth (MVD) format only a subset M of the N display views is transmitted. For those M video streams additional per-pixel depth data is transmitted as supplementary information. At the receiver depth image based rendering (DIBR) is applied to interpolate all N display views from the transmitted MVD data [15].
Thus, a multiview video plus depth (MVD) format allows reducing the raw data rate for 3DV systems drastically. Only a subset M of the N display views is transmitted. Additionally, depth data are transmitted for the subset M. The non-transmitted views can be generated by intermediate view interpolation at the receiver given the transmitted data [17].
3DV systems are capable to support head motion parallax viewing by displaying multiple views at the same time. Among many others for instance high-resolution LCD screens with slanted lenticular lens technology and 9 simultaneous views are commercially available from Philips [28]. The principle for head motion parallax support with a 3D display is illustrated in FIG. 20. A user at position 1 sees views 1 and 2 with right and left eye respectively only. Another user at position 3 sees views 6 and 7, hence multi-user 3D viewing is supported.
Assume a user moves from position 1 to position 2. Now views 2 and 3 are visible with the right and left eye respectively. If V1 and V2 is a stereo pair with proper human eye distance baseline, then V2 and V3 as well and so on, a user moving in front of such a 3D display system will perceive a 3D impression with dis-occlusions and occlusions of objects in the scenery depending on their depth. This motion parallax impression will not be seamless and the number of different positions is restricted to N−1.
To be more precise, multiview autostereoscopic displays process N synchronized video signals showing the same 3D scene from slightly different viewpoints. Compared to normal 2D video this is a tremendous increase of raw data rate. It has been shown that specific multiview video coding (MVC) including inter-view prediction of video signals taken from neighboring viewpoints can reduce the overall bit rate by 20% [20], compared to independent coding of all video signals (simulcast). This means a reduction by 20% of the single video bitrate multiplied by N. For a 9-view display MVC therefore still necessitates 7.2 times the corresponding single video bitrate. Such an increase is clearly prohibitive for the success of 3DV applications. Further, it has also been shown in [20] that the total bitrate of MVC increases linearly with N. Future displays with more views would therefore necessitate even higher total bitrates. Finally, fixing the number of views in the transmission format as done with MVC does not provide sufficient flexibility to support any type of current and future 3D displays.
For 2-view displays (or small number of views displays) a different approach was demonstrated to provide both high compression efficiency as well as extended functionality. Instead of transmitting a stereo video pair, one video and an associated per-pixel depth map is used. The depth map assigns a scene depth value to each of the pixels of the video signal, and with that provides a 3D scene description. The depth map can be treated as monochromatic video signal and coded using available video codecs. This way video plus depth (V+D) is defined as 3DV data format [7]. A corresponding standard known as MPEG-C Part 3 has been recently released by MPEG [11], [12]. From decoded V+D a receiver can generate a second video as stereo pair by DIBR. Experiments have shown that depth data can be compressed very efficiently in most cases. Only around 10-20% of the bitrate that may be used for the corresponding color video may be used to compress depth at a sufficient quality. This means that the final stereo pair rendered using this decoded depth is of same visual quality as if the 2 video signals were transmitted instead. However, it is known that DIBR introduces artifacts. Generating virtual views necessitates extrapolation of image content to some extend. From a virtual viewpoint parts of the 3D scene may become visible that are occluded behind foreground objects in the available original video. If the virtual viewpoint is close to the original camera position (e.g. corresponding to V1 and V2 in FIG. 20) masking of uncovered image regions works well with limited artifacts. Therefore V+D is an excellent concept for 3D displays with a small number of views. However, with increasing distance of the virtual viewpoint also the extrapolation artifacts increase. The concept of V+D is therefore not suitable for 3DV systems with a large number of views and motion parallax support over a wide range.
In consequence, neither MVC nor V+D are useful for advanced 3D display systems with a large number of views. A solution is an extension and combination to MVD as illustrated in FIG. 20. 9 views V1-V9 are displayed. Direct encoding with MVC would be highly inefficient. Transmitting only one video with a depth map e.g. V5+D5 would result in unacceptable quality of outer views. Using the MVD format a subset of M=3 views with depth maps is transmitted to the receiver. Intermediate views V2-V4 and V6-V8 are generated by DIBR. They are close enough to available original views to minimize extrapolation errors. Further, they can be interpolated from 2 directions (left and right neighbor view), thus the problem of uncovering can be widely minimized. For instance, regions to be generated for the virtual view that are occluded in the left view are very likely visible in the right view. However, there is still the possibility that parts are occluded in both original views and finally are to be extrapolated.
This advanced 3DV system concept includes a number of sophisticated processing steps that are partially unresolved and still necessitate research. Acquisition systems still have to be developed and optimized, which includes multi camera systems, possibly depth capture devices, as well as other types of maybe only supporting sensors and sources of information such as structured light [8], [22]. Sender side signal processing includes a lot advanced algorithms such as camera calibration, color correction, rectification, segmentation as well as depth estimation or generation. The latter is crucial for DIBR since any error of depth estimation results in reduced quality of rendered output views. It is a topic widely studied in computer vision literature, which may include semi-automatic processing as well [16], [18], [26], [29]. Optimum parameterization of the generic 3DV format still needs to be investigated, including the number of transmitted views with depth and the setting/spacing. Most efficient compression of the MVD data is still to be found, especially optimum treatment of depth. As usual transmission issues should be considered for different channels. Finally after decoding, the N output views are rendered out of the decoded MVD data. Here high quality with few artifacts is crucial for the success of the whole concept.
Finally, high quality view interpolation with a minimum of noticeable artifacts is a crucial prejudice for the success of 3DV systems. Interpolation artifacts especially occur along object boundaries with depth discontinuities. It would, therefore, be favorable to have an interpolation concept that allows for avoiding artifacts along such edges. Further, it would be favorable if the compression ratio for storing the data for 3DV could be reduced without significantly reducing or even maintaining the obtainable 3DV result.