The present application is concerned with multi-view coding.
Multi-view video sequences are basically captured as multiple single view sequences. These single view sequences are captured by multiple cameras simultaneously from different view-points of the same scene. Therefore, multi-view video sequences contain a high amount of inter-view redundancies.
A common technique to deal with these inter-view redundancies is inter-view prediction, analogous to the well known temporal motion-compensated or inter-frame prediction. In interview prediction, the reference frame does not relate temporally but spatially (regarding the camera position) to the frame to be coded. Since these two kinds of prediction are conceptually the same, they can be easily combined by using the same reference lists for both kinds of prediction (i.e., a reference picture list can contain both temporal reference pictures as well as inter-view reference pictures).
Such a combination of temporal and inter-view prediction is used by the MVC extension to H.264/AVC.
An example for an effective prediction structure combining temporal and inter-view prediction is presented in FIG. 1. On the left side, a possible prediction structure is shown for the 3 view case, on the right side, an example for the 2 view case is given. In both cases, view V0 is the reference view that is used for interview prediction.
Interview prediction, as used in MVC is a feasible technique to deal with interview redundancies if only few views are transmitted, e.g. in stereoscopic (two-view) video. The transmitted amount of data in MVC increases approximately linearly with the number of views. This makes MVC unsuitable for applications that demand a higher number of views, such as autostereoscopic displays, where 28 or more views are presented. In such a scenario, not all the views are transmitted, but only a few views, e.g. 3 views. The bigger part of the views is rendered at the decoder side using the transmitted views. In order to decrease complexity of rendering, new approaches in multi-view coding do not only encode texture (as in MVC), but also depth information in form of depth maps plus camera parameters. This provides the receiver with 3D scene information and eases the interpolation (rendering) of intermediate views.
Due to disocclusions and pixel displacements that are reaching out of the image plane, not all regions of a frame can be rendered from another view.
FIG. 2 sketches the rendering for a scene that just contains a square 10 in front of a white background (12). The right view and the left view are transmitted, the intermediate view is rendered. The regions marked hatched cannot be rendered from the right view, due to disocclusions (cross hatched with dashed lines) and pixel displacements reaching out of the image plane (cross hatched with continuous lines), while analogously the regions marked simply hatched cannot be rendered from the left view. The regions marked white, i.e. the background, and the object 10 in the rendered view are present in both (left and right) transmitted views.
FIG. 3 shows an example of the rendering process from a left view to a right view. The pixels of the image regions that cannot be rendered are set to black as shown at 14.
It becomes obvious that the transmitted views, i.e. left and right views in FIGS. 2 and 3, have almost the same content for a huge part of the image. Since the depth information 16 and camera parameters 18 are usually transmitted anyway in order to support the rendering at the decoder side, regions 14 that can be rendered by renderer 20 from one transmitted view, such as the left view in FIG. 3, to another transmitted view, such as the right view in FIG. 3, only need to be transmitted in the bitstream once. Thus, conceptually, if picture regions that can efficiently be rendered from one view to another, all regions of the right view except regions 14, are only transmitted once, a significant amount of the overall bit rate can be saved.
However, even the non-transmission of renderable portions of secondary transmitted views being renderable from primary transmitted views does not lead to an optimally efficient compression of the multi-view data. Accordingly, it would be favorable to have a multi-view concept at hand which enables a more efficient coding of transmitted views.