The invention relates to a method of encoding multiple views of an image into an image signal, such as for example a compressed television signal according to one of the MPEG standards.
The invention also relates to: an apparatus for generating such a signal, a receiver for receiving such a signal, a method of extracting the encoded information from the signal, so that it can be used for generating the multiple views, and the efficiently encoded signal itself.
There is currently work going on in the standardization of three-dimensional image information encoding. There are several ways of representing a three-dimensional object, for example as a set of voxels (popular e.g. in medical data display or industrial component inspection), or as a number of view images captured from different directions and intended to be viewed from different directions, for example by the two eyes of a single viewer or by multiple viewers, or a moving viewer, etc.
A popular format is the left/right format, in which a picture is captured by a camera on the left and a picture is captured by camera on the right. These pictures may be displayed on different displays, for example the left picture may be shown during a first set of time instances, and the right picture during an interleaved second set of time instances, the left and right eyes of the viewer being blocked synchronously with the displaying by shutter glasses. A projector with polarization means is another example of a display capable of generating a three-dimensional impression of a scene, at least render some of the three-dimensional information of the scene, namely what it approximately looks like in a certain direction (namely stereo).
Different qualities of approximation of the scene may be employed, e.g. the 3D scene may be represented as a set of flat layers behind each other. But these different qualities can be encoded by the existing formats.
Another popular display is the autostereoscopic display. This display is formed for example by placing an LCD behind a set of lenses, so that a group of pixels is projected to a region in space by a respective lens. In this way a number of cones is generated in space which two by two contain left and right images for a left and right eye, so that without glasses a user can position himself in a number of regions in space, and perceive 3D. However the data for these groups of pixels has to be generated from the left and right images. Another option is that a user can see an object from a number of intermediate directions between the left and right view of the stereo encoding, which intermediate views can be generated by calculating a disparity field between the left and the right picture, and subsequently interpolating.
It is a disadvantage of the prior art left/right encoding that considerable data is required to obtain the intermediate views, and that still somewhat disappointing results may be obtained. It is difficult to calculate a precisely matching disparity field, which will lead to artifacts in the interpolations, such as parts of a background sticking to a foreground object. A desire which led to the here below presented technological embodiments was to have an encoding way which can lead to relatively accurate results when converting to different formats, such as to a set of views with intermediate views, yet which does not comprise an undue amount of data.
Such requirements are at least partially fulfilled by a method of encoding multiple view image information into an image signal (200) comprising:                adding to the image signal (200) a first image (220) of pixel values representing one or more objects (110, 112) captured by a first camera (101);        adding to the image signal (200) a map (222) comprising for respective sets of pixels of the first image (220) respective values representing a three-dimensional position in space of a region of the one or more objects (110, 112) represented by the respective set of pixels; and        adding to the image signal (200) a partial representation (223) of a second image (224) of pixel values representing the one or more objects (110, 112) captured by a second camera (102), the partial representation (223) comprising at least the majority of the pixels representing regions of the one or more objects (110, 112) not visible to the first camera (101),and a signal obtained by the method or an apparatus allowing the performance of the method.        
The inventors have realized that if one understands that for quality reasons it is best to add to the left and right images a map containing information on the three-dimensional structure of the scene, representing at least this part of the three-dimensional scene information which is required for enabling the particular application (with the desired quality), an interesting encoding format may be conceived. For view interpolation, the map may be e.g. an accurately segmented disparity map, the disparity vectors of which will lead to a good interpolation of intermediate views. It is important to notice that this map can be tuned optimally on the creation/transmission side according to its use on the receiving side, i.e. e.g. according to how the three-dimensional environment will be simulated on the display, which means that it will typically have different properties than when it would be used to optimally predict regions of pixels in the left and right view.
The map may e.g. be fine-tuned, or even created, by a human operator, who may preview at his side how a number of intended displays would behave when receiving the signal. Nowadays, and in the future even more so, a part of the content is already computer-generated, such as e.g. a three-dimensional model of a dinosaur, or overlay graphics, which means that it is not too problematic to create at least for regions containing such man-made objects pixel accurate disparity maps, or depth maps, or similar maps.
This is certainly true for game applications, in which e.g. a user can move slightly compared to the scene, and may want to see the scene differently, but in the near future the invention may also become important for 3D television, captured with two cameras, or even generated on the basis of e.g. motion parallax. Already an increasing number of studios (e.g. for the BBC) are using e.g. virtual surroundings for the news.
This map may be encoded with little data overhead, e.g. as a grey values image, compressed according to the MPEG-2 standard, and appended to the left/right image (or images for several time instants for moving video) already in the signal.
Having this map, however the inventors realized, allows a further reduction of the amount of data, because a part of the scene is imaged by both cameras. Although the pixel information may be useful for bi-directional interpolation (e.g. specular reflections towards one of the cameras may be mitigated), in fact not so much important information will be present in the doubly coded parts. Therefore, having available the map, it can be determined which parts of the second image (e.g. the right image) need to be encoded (and transmitted), and which parts are less relevant for the particular application. And on the receiving side a good quality reconstruction of the missing data can be realized.
E.g., in a simple scene approximation (capturing), with an object with an essentially flat face towards the cameras (which may be positioned parallel or under a small angle towards the scene), and not too closeby, the missing part in the first (left) image which is captured in the second (right) image consists of pixels of a background object (e.g. the elements of the scene at infinity).
An interesting embodiment involves the encoding of a partial second disparity or depth map, or similar. This partial e.g. depth map will substantially contain depth values of the region that could not be imaged by the first camera. From this depth data it can then be inferred on the receiving side which uncovered part belongs to a foreground object having a first depth (indicated by 130 in FIG. 1), and which part belongs to the background (132). This may allow better interpolation strategies, e.g. the amount of stretching and filling of holes may be fine-tuned, a pseudo-perspective rendering of an ear may be rendered in the intermediate image instead of just background pixels, etc. Another example is that the trapezium distortion of angled cameras may be encoded in this second map for receiver side compensation.
In case of trapezium deformation from capturing with (typically slightly) converging cameras, there will in general be a vertical disparity in addition to a horizontal one. This vertical component can be encoded vectorially, or in a second map, as already envisioned e.g. in the “auxiliary data representation” proposals of the MPEG-4 subgroup Video-3DAV (e.g. ISO/IEC JTC1/SC29/WG11 Docs. MPEG2005/12603, 12602, 12600, 12595). The components of the disparity can be mapped to the luma and/or chrominance of an auxiliary picture, e.g. the horizontal disparity can be mapped high resolution to the luma, and the vertical disparities can be mapped with a scheme to one or two chrominance components (so that some of the data is in the U and by a mathematical split as much of the additional data in the V).
Advantages of a partial left+right+“depth” format over e.g. first encoding to a center view+“depth”+bisidal occlusion data are the following. Transforming occlusion data to the center view—instead of storing it on an original camera capturing view, leads to processing inaccuracies (in particular if the depth map(s) is of automatically derived and of lower quality/consistency, having temporal and spatial imperfections), and hence coding inefficiency. Also when calculation an intermediate view further inaccuracies will come on top of this.