3D Video or 3D TV is gaining increasing momentum in recent years. A number of standardization bodies (ITU, EBU, SMPTE, MPEG, and DVB) and other international groups (e.g. DTG, SCTE), are working toward standards for 3D TV or Video. Quite a few broadcasters have launched or are planning to launch public Stereoscopic 3D TV broadcasting.
Several 3D video coding schemes have been proposed as revealed by A. Smolic et al. in “An Overview of Available and Emerging 3D Video Formats and Depth Enhanced Stereo as Efficient Generic Solution”, Proceedings of 27th Picture Coding Symposium (PCS 2009), May 6-8, 2009, Chicago, Ill., USA. Among them are Video plus depth (V+D), Multiview Video (MVV), Multiview Video plus Depth (MVD), Layered Depth Video (LDV), and Depth Enhanced Video (DES).
In multiview video (e.g. for autostereoscopic displays), a number of views are required at the receiver side. The trend seems to be that the more advanced the autostereoscopic technology becomes, the more views are being used. Recently an autostereoscopic screen with 28 views has been released. Obviously, transmitting all of these views over a channel or network demands too much bandwidth (too high bit rates), hence it is less practical. Therefore, it is desirable, see Video and Requirements Group, “Vision on 3D Video,” ISO/IEC JTC1/SC29/WG11 N10357, Lausanne, C H, February 2008 (http://www.chiariglione.org/mpeg/visions/3dv/index.htm), to send only a small number of views (e.g. 2 or 3) while the other views are synthesized at the receiver side. Similarly, in free viewpoint 3D TV or video, the number of views that need to be available at the receiver side is very large, since it depends on the position or viewing angle of the viewer relative to the display. So it is impossible to transmit all the possible views from the sender. The only sensible way is to synthesize many virtual views from a limited number of views that are sent over the channel/network.
An example of a view synthesis system 10, which is one of the key technologies involved in multiview or free viewpoint 3D video, is illustrated in FIG. 1. The system comprises a processor or CPU 12, a memory 14 and input/output interfacing circuitry 16. As input to the view synthesis system there are usually two or three reference images (I_i), the corresponding depth maps (D_i) and the camera parameters. From this data one may synthesize images from new viewpoints (I_new) using standard techniques to transfer pixels from one image to another.
Synthesizing a new view can actually be performed using only one image and the corresponding depth map. Using less image information may achieve lower bitrates in the transmission. However, it is then likely that there will be areas in the synthesized image where there is no information. This typically happens where the background is occluded by a foreground object in the reference image but is visible in the synthesized new image, or along a side of the synthesized image 22, cf. the grey shaded pixels of the foreground object 20 in the left hand view of FIG. 2. In order to handle this problem image and depth information from a second view is used; from which second view the occluded area is visible. This is the reason why more than one image usually is used in the synthesis.
In one approach one full image texture+depth together with only parts of the other images+depths are needed in order to cope with the holes 26 in the synthesized image due to occlusions, cf. right hand side of FIG. 2. These are called sparse images 24 and depths since they only contain valid information in certain areas in the image. When a new image is synthesized, all pixels in the full reference image are transferred to positions of the new image and the areas containing the valid pixels from the sparse images are transferred to other positions of the new image. All together the transferred pixels create a complete new image without holes.
The sparse representation approach is illustrated in the view synthesis system 10 of FIG. 3. The scheme illustrates a scenario with two input reference views, but it could easily be extended to N (e.g. 3) input reference views. The left depth map is encoded using a Multi View (MV) encoder 30 such as Multi View Codec (MVC), an extension to H.264. Future MV codec's based on for instance the High Efficiency Video Codec (HEVC), are also an option as well as using simulcast coding. The reconstructed left depth map is fed into the disocclusion detection system which outputs a disocclusion map with pixels marked as either disoccluded or not. The full and sparse views for texture and the sparse view for depth are then encoded with the disocclusion map indicating which blocks that need to be encoded and which blocks that can be skipped.
At the decoder 32 side, texture and depth are decoded using standard MV decoders. The reconstructed left depth map is fed into the disocclusion detection system which outputs an identical disocclusion map to what was used in the encoder 30. Since the disocclusion map is derived identically in the encoder 30 side and the decoder 32 side, without sending it explicitly, this approach is denoted as “implicit disocclusion map signaling”.
An alternative to having the disocclusion detection in the decoder (i.e., implicit disocclusion map signaling) would be to signal the disocclusion map explicitly to the view synthesis system. An advantage with this would be that the disocclusion detection could be run using the uncompressed reference depth map as well as the uncompressed reference video. An advantage with the solution described in FIG. 3 is of course that no extra bits are needed for the disocclusion map.
Finally, the view synthesis system 10 takes the decoded texture and depth and the disocclusion map in order to create the required output views. The disocclusion map is here needed for the view synthesis system to know what parts of the sparse texture can be used for creating each output view.
In order to use sparse representation, the encoder 30 and decoder 32 need to know what blocks to encode and what blocks that can be skipped. Blocks which are fully or partially disoccluded need to be encoded and sent. Disoccluded areas must thus be detected.
As described in European patent application no. 10190368.0, disocclusion detection may be performed by utilizing only the depth maps of the corresponding views. Instead of searching for disoccluded areas in 2D images, this approach derives the disoccluded areas through 3D geometric calculations. The advantage with this solution is that the view synthesis can be more easily performed on the decoding side without having to submit the disocclusion map explicitly. It is also less sensitive to texture noise than 2D image based approaches.
The key equation for disocclusion detection in European patent application no. 10190368.0 is given as follows.1/z0−1/z1>T/(au*sH)Here, z0 and z1 denote depth values associated with two neighboring pixels, au is the camera focal length, and sH is the relative horizontal translation between the reference camera and the virtual camera. Both au and sH are determined through the camera parameters. T is a threshold that is indicative of a lower boundary for a number of neighboring disoccluded pixels that are detected by the algorithm, i.e. the above condition is true if a hole of more than T pixels width is detected.