View synthesis, VS, describes the process of synthesizing a video view at a virtual camera position, using a video view/texture data and an associated depth map at a reference camera position. VS can be used as a part of a 3D video compression scheme and is then denoted as view synthesis prediction, VSP. In some 3D video compression schemes, such as considered in the work of the moving picture experts group, MPEG, depth maps are coded at reduced resolution, i.e. lower resolution than the associated video data. Work is currently going on within the joint collaborative team on 3D video coding extension development of ITU-T SG 16 WP 3 and ISO/IEC JTC 1/SC 29/WG 11.
However conventional VS and thus VSP algorithms require depth information at the same resolution as the video data. The problem is how to perform VSP using low resolution depth map data. VSP may be part of 3D video encoders and decoders but may also be performed externally. For instance, it may be applied for image rendering after 3D video decoding. The operation could be performed in a device such as a mobile phone, a tablet, a laptop, a PC, a set-top-box, or a television set.
In some more detail, coding and compression of 3D video, involve reducing the amount of data in image sequences of e.g. stereoscopic (two cameras) or multiview (several cameras) image sequences, which in addition to one or multiple video views containing texture data contain one or multiple associated depth maps. That is, for one or several of the video views, an associated depth map is available, describing the depth of the scene from the same camera position as the associated video view. With the help of the depth maps, the contained video views can be used to generate additional video views e.g. for positions in between or outside the positions for the contained video views. A process of generating additional views by view synthesis is illustrated in FIG. 1.
FIG. 1 depicts an example with two original camera positions, a first camera position and a second camera position. For these positions, both video views 102, 106 and associated depth maps 104, 108 exist. Using a single video view and a depth map, additional views at virtual camera positions can be generated using view synthesis as illustrated in FIG. 1 by a video view 112 at a virtual camera position 0. Alternatively, two or more pairs of video views and depth maps can be used to generate an additional view 114 at a virtual camera position between the first and the second camera positions as illustrated in FIG. 1 by virtual camera position 0.5.
Typically, in the view synthesis process, for each pixel in a video view, an associated depth map pixel exists. The depth map pixel can be transformed into a disparity value using known techniques. The disparity value can be seen as a value that maps pixels between original and synthesized positions, e.g. how many pixels an image point in the original image “moves” in horizontal direction when the synthesized image is created. The disparity value can be used to determine a target position of the associated video pixel with respect to the virtual camera position. Thus, the synthesized view may be formed by reusing the associated video pixels at the respective target positions.
Traditional ‘mono’ (one camera) video sequences can be effectively compressed by predicting the pixel values for an image using previous images and only code the differences after prediction (inter-frame video coding). For the case of 3D video with multiple views and both video and depth maps, additional prediction references can be generated by means of view synthesis. For instance, when two video views with associated depth maps are compressed, the video view and associated depth map for the first camera position can be used to generate an additional prediction reference to be utilized for coding of the second video view. This process is the View Synthesis Prediction , which is illustrated in FIG. 2.
In FIG. 2, the video view 202 and associated depth map 204 at a first camera position is used to synthesize a reference picture in the form of a virtual view 214 at a virtual second camera position. The synthesized reference picture in the form of the virtual view 214 is then used as a prediction reference for coding of the video view 216 and a depth map 218 at the second camera position. Note that this prediction is conceptually similar to “inter-frame” prediction in conventional (mono) video coding. As in conventional video coding, the prediction is a normative (standardized) process that is performed both on the encoder and the decoder side.
In the “3DV” activity in MPEG standardization, VSP is considered as a potential coding tool. Moreover, the coding of depth maps at reduced resolution is considered. That is, the depth maps have a lower resolution than the associated video views. This is to reduce the amount of data to be coded and transmitted and to reduce the complexity in the decoding, i.e. reduce decoding time. On the other hand, view synthesis algorithms typically rely on the fact that video view and depth map have the same resolution. Thus, in the current test model/reference software in MPEG (H.264/AVC-based MPEG 3D video encoding/decoding algorithm, under development), the depth map is upsampled to the full (video view) resolution before it is used in the VSP.
One approach to perform the depth map upsampling is to use “bilinear” filtering. Additionally, several algorithms have been proposed in MPEG to improve the quality of the upsampled depth map, and thus improve the accuracy of VSP operation, aiming at better 3D video coding efficiency.
In the current test model in MPEG, the depth maps are coded at reduced resolution. The depth maps themselves are coded by means of inter-frame prediction. That is, low resolution depth maps need to be stored as reference frames for future prediction. Additionally, the depth maps are upsampled to full (video view) resolution, so as to be used for VSP. That means, for a given moment in time, both the low resolution depth map and the upsampled depth map need to be stored (in the encoder and decoder). For example, assuming 8-bit depth representation, a video resolution of W*H and depth map resolution of W*H/4, the amount of data to be stored for the low resolution depth map is W*H/4 bytes (around 500 kbytes for full HD resolution) and the additional amount of data to be stored for the full resolution depth map (to be used for VSP) is W*H bytes (around 2 Mbytes for full HD).
One drawback of such a scheme is the additional storage requirement for the upsampled depth map. A second drawback is the computational requirements associated with the depth map upsampling.
These two drawbacks have influence on the compression efficiency of the 3D video coding system that utilizes such a scheme. That is, the 3D video coding is affected by the quality of the VS process used in the VSP, and thus is affected by the depth maps used for the VS.