The present invention is concerned with multi-view coding.
In multi-view video coding, two or more views of a video scene (which are simultaneously captured by multiple cameras) are coded in a single bitstream. The primary goal of multi-view video coding is to provide the end user with an advanced multimedia experience by offering a 3-d viewing impression. If two views are coded, the two reconstructed video sequences can be displayed on a conventional stereo display (with glasses). However, the necessitated usage of glasses for conventional stereo displays is often annoying for the user. Enabling a high-quality stereo viewing impression without glasses is currently an important topic in research and development. A promising technique for such autostereoscopic displays is based on lenticular lens systems. In principle, an array of cylindrical lenses is mounted on a conventional display in a way that multiple views of a video scene are displayed at the same time. Each view is displayed in a small cone, so that each eye of the user sees a different image; this effect creates the stereo impression without special glasses. However, such autosteroscopic displays necessitate typically 10-30 views of the same video scene (even more views may be necessitated if the technology is improved further). More than 2 views can also be used for providing the user with the possibility to interactively select the viewpoint for a video scene. But the coding of multiple views of a video scene drastically increases the necessitated bit rate in comparison to conventional single-view (2-d) video. Typically, the necessitated bit rate increases approximately linearly with the number of coded views. A concept for reducing the amount of transmitted data for autostereoscopic displays consists of transmitting only a small number of views (perhaps 2-5 views), but additionally transmitting so-called depth maps, which represent the depth (distance of the real world object to the camera) of the image samples for one or more views. Given a small number of coded views with corresponding depth maps, high-quality intermediate views (virtual views that lie between the coded views)—and to some extend also additional views to one or both ends of the camera array—can be created at the receiver side by a suitable rendering techniques.
For both stereo video coding and general multi-view video coding (with or without depth maps), it is important to exploit the interdependencies between the different views. Since all views represent the same video scene (from a slightly different perspective), there is a large amount of interdependencies between the multiple views. The goal for designing a highly efficient multi-view video coding system is to efficiently exploit these interdependencies. In conventional approaches for multi-view video coding, as for example in the multi-view video coding (MVC) extension of ITU-T Rec. H.264|ISO/IEC 14496-10, the only technique that exploits view interdependencies is a disparity-compensated prediction of image samples from already coded views, which is conceptually similar to the motion-compensated prediction that is used in conventional 2-d video coding. However, typically only a small subset of image samples is predicted from already coded views, since the temporal motion-compensated prediction is often more effective (the similarity between two temporally successive images is larger than the similarity between neighboring views at the same time instant). In order to further improve the effectiveness of multi-view video coding, it is necessitated to combine the efficient motion-compensated prediction with inter-view prediction techniques. One possibility is to re-use the motion data that are coded in one view for predicting the motion data of other views. Since all views represent the same video scene, the motion in one view is connected to the motion in other views based on the geometry of the real-world scene, which can be represented by depth maps and some camera parameters.
In state-of-the-art image and video coding, the pictures or particular sets of sample arrays for the pictures are usually decomposed into blocks, which are associated with particular coding parameters. The pictures usually consist of multiple sample arrays (luminance and chrominance). In addition, a picture may also be associated with additional auxiliary samples arrays, which may, for example, specify transparency information or depth maps. Each picture or sample array is usually decomposed into blocks. The blocks (or the corresponding blocks of sample arrays) are predicted by either inter-picture prediction or intra-picture prediction. The blocks can have different sizes and can be either quadratic or rectangular. The partitioning of a picture into blocks can be either fixed by the syntax, or it can be (at least partly) signaled inside the bitstream. Often syntax elements are transmitted that signal the subdivision for blocks of predefined sizes. Such syntax elements may specify whether and how a block is subdivided into smaller blocks and being associated coding parameters, e.g. for the purpose of prediction. For all samples of a block (or the corresponding blocks of sample arrays) the decoding of the associated coding parameters is specified in a certain way. In the example, all samples in a block are predicted using the same set of prediction parameters, such as reference indices (identifying a reference picture in the set of already coded pictures), motion parameters (specifying a measure for the movement of a blocks between a reference picture and the current picture), parameters for specifying the interpolation filter, intra prediction modes, etc. The motion parameters can be represented by displacement vectors with a horizontal and vertical component or by higher order motion parameters such as affine motion parameters consisting of six components. It is also possible that more than one set of particular prediction parameters (such as reference indices and motion parameters) are associated with a single block. In that case, for each set of these particular prediction parameters, a single intermediate prediction signal for the block (or the corresponding blocks of sample arrays) is generated, and the final prediction signal is built by a combination including superimposing the intermediate prediction signals. The corresponding weighting parameters and potentially also a constant offset (which is added to the weighted sum) can either be fixed for a picture, or a reference picture, or a set of reference pictures, or they can be included in the set of prediction parameters for the corresponding block. The difference between the original blocks (or the corresponding blocks of sample arrays) and their prediction signals, also referred to as the residual signal, is usually transformed and quantized. Often, a two-dimensional transform is applied to the residual signal (or the corresponding sample arrays for the residual block). For transform coding, the blocks (or the corresponding blocks of sample arrays), for which a particular set of prediction parameters has been used, can be further split before applying the transform. The transform blocks can be equal to or smaller than the blocks that are used for prediction. It is also possible that a transform block includes more than one of the blocks that are used for prediction. Different transform blocks can have different sizes and the transform blocks can represent quadratic or rectangular blocks. After transform, the resulting transform coefficients are quantized and so-called transform coefficient levels are obtained. The transform coefficient levels as well as the prediction parameters and, if present, the subdivision information is entropy coded.
The state-of-the-art in multi-view video coding extends the 2-d video coding techniques in a straightforward way. Conceptually, two or more video sequences, which correspond to the different views, are coded (or decoded) in parallel. Or more specifically, for each access unit (or time instant), the pictures corresponding to the different views are coded in a given view order. An MVC bitstream contains a base view, which can be decoded without any reference to other views. This ensures backwards compatibility with the underlying 2-d video coding standard/scheme. The bitstream is usually constructed in a way that the sub-bitstream corresponding to the base view (and in addition sub-bitstreams corresponding to particular subsets of the coded views) can be extracted in a simple way by discarding some packets of the entire bitstream. In order to exploit dependencies between views, pictures of already coded views of the current access unit can be used for the prediction of blocks of the current view. This prediction is often referred to as disparity-compensated prediction or inter-view prediction. It is basically identical to the motion-compensated prediction in conventional 2-d video coding; the only difference is that the reference picture represents a picture of a different view inside the current access unit (i.e., at the same time instant) and not a picture of the same view at a different time instant. For incorporating inter-view prediction in the design of the underlying 2-d video coding scheme, for each picture, one or more reference picture lists are constructed. For the base view (independently decodable view), only conventional temporal reference pictures are inserted into the reference picture lists. However, for all other views, inter-view reference pictures can be inserted into a reference picture list in addition (or instead of) temporal reference pictures. Which pictures are inserted into a reference picture list determined by the video coding standard/scheme and/or signaled inside the bitstream (e.g., in a parameter set and/or slice header). Whether a temporal or inter-view reference picture is chosen for a particular block of the current view is then signaled by coding (or inferring) a reference picture index. I.e., the inter-view reference pictures are used in exactly the same way as conventional temporal reference pictures; only the construction of the reference picture lists of slightly extended.
The current state-of-the-art in multi-view video coding is the Multi-view Video Coding (MVC) extension of ITU-T Rec. H.264|ISO/IEC JTC 1 [1][2]. MVC is a straightforward extension of ITU-T Rec. H.264|ISO/IEC JTC 1 towards multi-view video coding. Beside some extensions of the high level syntax, the only tool that has been added is the disparity-compensated prediction as described above. However, it should be noted that disparity-compensated prediction is typically only used for a small percentage of block. Except for regions that are covered or uncovered due to the motion inside a scene, the temporal motion-compensated prediction typically provides a better prediction signal than the disparity-compensated prediction, in particular if the temporal distance between the current and the reference picture is small. The overall coding efficiency could be improved if the temporal motion-compensated prediction could be combined with suitable inter-view prediction techniques. There is a conceptually similar problem in scalable video coding, where two representations of the same video sequence with different resolutions or fidelities are coded in a single bitstream. For the enhancement layer, there are in principle two possibilities to prediction a block of samples (if we ignore spatial intra prediction), using a temporal motion-compensated prediction from an already coded enhancement layer picture or an inter-layer prediction from the lower layer. In Scalable Video Coding (SVC) extension [3], the conventional temporal motion-compensated prediction has been combined with an inter-layer prediction of motion parameters. For an enhancement layer block, it provides the possibility to re-use the motion data of the co-located base layer block, but apply it to the enhancement layer (i.e., use the enhancement layer reference picture with base layer motion data). In this way, the temporal motion-compensated prediction inside a layer is efficiently combined with an inter-layer prediction of motion data. The general idea behind this technique is that all layers in a scalable bitstream show the same content, and hence also the motion inside each layer is the same. It does not necessarily mean that the best motion parameters for one layer are also the best motion parameters for a following layer due to the following effects: (1) The quantization of the reference pictures modifies the sample values and since different layers are quantized differently, the motion parameters that give the smallest distortion can be different for different layers; (2) Since the layers are coded at different bit rates, a particular set of motion parameters usually corresponds to a different trade-off between rate and distortion. And in rate-distortion optimized coding (which is for example achieved by minimizing of the Lagrangian functional D+λR of the distortion D and the associated rate R), different motion parameters can be optimal in rate-distortion sense for different layers (the operating point given by λ as well as the associated distortion or rate can be different). Nonetheless, the (optimal) motion parameters in base and enhancement layer are usually similar. And it is typically very likely that a mode the re-uses the motion parameters of the base layer (and is therefore associated with a small rate R) leads to a smaller overall cost (D+λR) than the optimal mode that is independent of the base layer. Or in other words, it is likely that the distortion increase ΔD that is associated by choosing the mode with base layer motion data instead of the mode with optimal enhancement motion data is smaller than the cost that is associated with the decrease in rate (ΔD<λΔR<0).
Conceptually, a similar concept as for SVC can also be used in multi-view video coding. The multiple cameras capture the same video scene from different perspective. However, if a real world object moves in the scene, the motion parameters in different captured views are not independent. But in contrast to scalable coding, where the position of an object is the same in all layers (a layer represent just a different resolution or a different quality of the same captured video), the interrelationship of the projected motion is more complicated and depends on several camera parameters as well as on the 3-d relationships in the real-world scene. But if all relevant camera parameters (such as focal length, distance of the cameras, and direction of the optical axis of the cameras) as well as the distance of the projected object points (depth map) are given, the motion inside a particular view can be derived based on the motion of another view. In general, for coding a video sequence or view, we don't need to know the exact motion of the object points; instead simple parameters such as motion vectors for blocks of samples are sufficient. In this spirit, also the relationship of the motion parameters between different views can be simplified to some extent