In the emerging art of three-dimensional (3D) video, various methods exist for encoding a third dimension into the video data signal. A popular approach for representing 3D video is to use one or more two-dimensional (2D) images plus a depth representation providing information of the third dimension. This approach also allows 2D images to be generated with different viewpoints and viewing angles than the 2D images which are included in the 3D image data. Such an approach provides a number of advantages including allowing 3D views to be generated with relatively low complexity and providing an efficient data representation thereby reducing, e.g., storage and communication resource requirements for 3D video signals.
When generating images with different viewpoints, the different depths of different objects causes shifting of object boundaries rather than what should be done, generating new boundaries. The shifting of the object boundaries may cause undesired effects reducing the image quality. When, e.g., the light reflection from an out-of-focus foreground object is mixed with an in-focus background object unrealistic looking boundaries may appear. This problem is usually solved using an alpha map which is transmitted in a separate layer of the video data signal.
The alpha map comprises alpha values indicating for each pixel whether it is a foreground pixel, a background pixel or a mixed pixel in which the color is partly determined by the foreground object and partly by the background object. The alpha value reflects the mix ratio. Henceforth these mixed pixels are also referred to as ‘uncertain’ pixels. This mixing of colors is also called blending. For encoding purposes, the alpha values may be retrieved from existing data, manually assigned or estimated. An alpha estimation algorithm typically uses spatially nearby samples from the foreground and background to estimate a value for alpha for all pixels in the ‘uncertain’ region. To facilitate this estimation process, a so-called trimap is first produced, indicating for each pixel whether it is foreground, background or uncertain. Multiple spatially nearby samples are taken from the nearby foreground and background in order to estimate the foreground value, background value and alpha value for a pixel in the uncertain region. When generating a new view, the shifted foreground pixel value is blended with the new background.
Typically, an alpha map comprises relatively large areas with a value ‘1’ for foreground pixels or ‘0’ for background pixels. In between these areas, the alpha values make a quick transition from ‘0’ to ‘1’ or vice versa. This is, e.g., the case for object transitions where the foreground object is out-of-focus and for very thin objects such as hair, where it is convenient to use transparency as a mechanism for dealing with these objects. True transparency over larger regions such as windows, etc. does not occur very often in natural video. The spatially fast changes in alpha maps make them rather inefficient to compress and increases the transmission cost of the video data signal.