Three dimensional video technology continues to grow in popularity, and 3D technology capabilities have evolved rapidly in recent years. Production studios are now developing a number of titles for 3D cinema release each year, and 3D enabled home cinema systems are widely available. Research in this sector continues to gain momentum, fuelled by the success of current 3D product offerings and supported by interest from industry, academia and consumers.
The term 3D is usually used to refer to a stereoscopic experience, in which an observer's eyes are provided with two slightly different images of a scene, which images are fused in the observer's brain to create an impression of depth. This effect is typically used in 3D films for cinema release and provides an excellent 3D experience to a stationary observer. However, stereoscopic technology is merely one technique for producing 3D video images. Free viewpoint television (FTV) is a new audiovisual system that allows observers to view 3D video content while freely changing position in front of a 3D video display. In contrast to stereoscopic technology, which requires the observer to remain stationary to experience the 3D content, FTV allows an observer to view a scene from many different angles, greatly enhancing the impression of being actually present within the scene.
The FTV functionality is enabled by capturing a scene using many different cameras which observe the scene from different angles or viewpoints. These cameras generate what is known as multiview video. Multiview video can be relatively efficiently encoded by exploiting both temporal and spatial similarities that exist in different views. However, even with multiview coding (MVC), the transmission cost for multiview video remains prohibitively high. To address this, current versions of FTV only actually transmit a subset of captured multiple views, typically between 2 and 3 of the available views. To compensate for the missing information, depth and disparity maps are used to recreate the missing data. From the multiview video and depth/disparity information, virtual views can be generated at any arbitrary viewing position. Many techniques exist in the literature to achieve this, depth image-based rendering (DIBR) being one of the most prominent.
A depth map, as used in FTV, is simply a greyscale image of a scene in which each pixel indicates the distance between the corresponding pixel in a video object and the capturing camera. A disparity map is an intensity image conveying the apparent shift of a pixel which results from moving from one viewpoint to another. The link between depth and disparity can be appreciated by considering that the closer an object is to a capturing camera, the greater will be the apparent positional shift resulting from a change in viewpoint. A key advantage of depth and disparity maps is that they contain large smooth surfaces of constant grey levels, making them comparatively easy to compress for transmission using current video coding technology.
A depth map for a scene or environment may be compiled by measuring depth of objects within the environment using specialised cameras. Structured light and time of flight cameras are two examples of such specialised cameras which may be used to measure depth. In a structured light camera, a known pattern of pixels (often a grid or horizontal bars) is projected onto a scene. Deformation of the known light pattern on striking different objects is recorded and used to calculate depth information for the objects. In a time of flight camera, the round trip time of a projected pulse of light is recorded and used to calculate the required depth information. One drawback of such specialist cameras is that the range of depths that can be measured is limited. Objects that are too close or too far away from the device cannot be sensed, and hence will have no depth information. A further drawback is the generation of disocclusion gaps. The specific configuration of depth sensing devices generally requires the provision of a transmitting device (such as an IR projector) and a recording device (such as an IR camera) to be positioned at a distance from one another. This arrangement results in occlusions in the background of the sensed environment resulting from the presence of foreground objects. Only the foreground objects receive for example the projected light pattern, meaning that no depth information can be obtained for the background area that is obscured by the foreground object. These obscured regions of no depth information are known as disocclusion gaps. An example of a disocclusion gap for a structured light depth sensing device is illustrated in FIG. 1. The foreground object obscures the background, preventing a projected light pattern and or camera view from reaching the background over a limited area, and generating the shaded disocclusion gaps. One gap is created on the left of the object where projected light does not reach the background. A second gap is created on the right of the object where the camera is not able to view the background.
Other issues such as non-reflective surfaces or different intrinsic camera parameters can also result in areas of a scene having missing depth values. FIG. 2a shows image texture of a scene recorded with an image camera. FIG. 2b shows a depth map of the same scene recorded using a depth sensing apparatus. The cross hatched area in the depth map indicates all the areas for which depth information could not be obtained as result of range limitations, disocclusion gaps or other constraints. It can be seen that a significant proportion of the scene has no available depth data.
A further limitation of depth sensing apparatus is accuracy. Not only is depth data unavailable for large areas of an environment, but those depth values that are available are often “noisy”. Significant post processing may be required to filter and optimise the sensed depth values in order to improve their quality.
The limitations of depth sensing apparatus may result in a depth map of a scene that is both incomplete and inaccurate, and this can become a significant issue when using the map to generate 3D video images. If depth data for an object is unavailable or inaccurate then the 3D screen displaying the video content will not render the object correctly in three dimensions. This causes significant eye strain for the viewer and degrades the overall 3D experience. Object depth data needs to be both available and as accurate as possible in order to enable 3D video technology to provide a credible reflection of reality.
Existing attempts to address issues with the availability of depth data remain unsatisfactory. Filter and optimisation based frameworks use an iteratively optimised model of a scene and discard the missing depth areas. These methods do not attempt to fill in the missing depth data but merely filter the raw depth data temporally and spatially. Alternative options propose the use of one or more additional cameras together with a stereo-based depth estimation method in order to merge the measured and estimated depth maps. The integration of an additional camera or cameras significantly increases both the computational burden and the practical complexities of these methods.