Stereoscopic or three-dimensional (3D) television (3D-TV) is expected to be a next step in the advancement of television. Stereoscopic images that are displayed on a 3D-TV are expected to increase visual impact and heighten the sense of presence for viewers. 3D-TV displays may also provide multiple stereoscopic views, offering motion parallax as well as stereoscopic information.
A successful adoption of 3D-TV by the general public will depend not only on technological advances in stereoscopic and multi-view 3D displays, but also on the availability of a wide variety of program contents in 3D. One way to alleviate the likely lack of program material in the early stages of 3D-TV rollout is to find a way to convert two-dimensional (2D) still and video images into 3D images, which would also enable content providers to re-use their vast library of program material in 3D-TV.
In order to generate a 3D impression on a multi-view display device, images from different view points have to be presented. This requires either multiple input views consisting of camera-captured images or rendered images based on some 3D or depth information. This depth information can be either recorded, generated from multiview camera systems or generated from conventional 2D video material. In a technique called depth image based rendering (DIBR), images with new camera viewpoints are generated using information from an original monoscopic source image and its corresponding depth map containing depth values for each pixel or groups of pixels of the monoscopic source image. These new images then can be used for 3D or multiview imaging devices. The depth map can be viewed as a gray-scale image in which each pixel is assigned a depth value representing distance to the viewer, either relative or absolute. Alternatively, the depth value of a pixel may be understood as the distance of the point of the three-dimensional scene represented by the pixel from a reference plane that may for example coincide with the plane of the image during image capture or display. It is usually assumed that the higher the gray-value (lighter gray) associated with a pixel, the nearer is it situated to the viewer.
A depth map makes it possible to obtain from the starting image a second image that, together with the starting image, constitutes a stereoscopic pair providing a three-dimensional vision of the scene. The depth maps are first generated from information contained in the 2D color images and then both are used in depth image based rendering for creating stereoscopic image pairs or sets of stereoscopic image pairs for 3D viewing. In the rendering process, each depth map provides the depth information for modifying the pixels of its associated color image to create new images as if they were taken with a camera that is slightly shifted from its original and actual position. Examples of the DIBR technique are disclosed, for example, in articles K. T. Kim, M. Siegel, & J. Y. Son, “Synthesis of a high-resolution 3D stereoscopic image pair from a high-resolution monoscopic image and a low-resolution depth map,” Proceedings of the SPIE: Stereoscopic Displays and Applications IX, Vol. 3295A, pp. 76-86, San Jose, Calif., U.S.A., 1998; and J. Flack, P. Harman, & S. Fox, “Low bandwidth stereoscopic image encoding and transmission,” Proceedings of the SPIE: Stereoscopic Displays and Virtual Reality Systems X, Vol. 5006, pp. 206-214, Santa Clara, Calif., USA, January 2003; L. Zhang & W. J. Tam, “Stereoscopic image generation based on depth images for 3D TV,” IEEE Transactions on Broadcasting, Vol. 51, pp. 191-199, 2005.
Advantageously, based on information from the depth maps, DIBR permits the creation of a set of images as if they were captured with a camera from a range of viewpoints. This feature is particularly suited for multiview stereoscopic displays where several views are required.
One problem with conventional DIBR is that accurate depth maps are expensive or cumbersome to acquire either directly or from a 2D image. For example, a “true” depth map can be generated using a commercial depth camera such as the ZCam™ available from 3DV Systems, Israel, that measures the distance to objects in a scene using an infra-red (IR) pulsed light source and an IR sensor sensing the reflected light from the surface of each object. Depth maps can also be obtained by projecting a structured light pattern onto the scene so that the depths of the various objects could be recovered by analyzing distortions of the light pattern. Disadvantageously, these methods require highly specialized hardware and/or cumbersome recording procedures, restrictive scene lighting and limited scene depth.
Although many algorithms exist in the art for generating a depth map from a 2D image, they are typically computationally complex and often require manual or semi-automatic processing. For example, a typical step in the 2D-to-3D conversion process may be to generate depth maps by examining selected key frames in a video sequence and to manually mark regions that are foreground, mid-ground, and background. A specially designed computer software may then be used to track the regions in consecutive frames to allocate the depth values according to the markings This type of approach requires trained technicians, and the task can be quite laborious and time-consuming for a full-length movie. Examples of prior art methods of depth map generation which involve intensive human intervention are disclosed in U.S. Pat. Nos. 7,035,451 and 7,054,478 issued to Harman et al.
Another group of approaches to depth map generation relies on extracting depth from the level of sharpness, or blur, in different image areas. These approaches are based on realization that there is a relationship between the depth of an object, i.e., its distance from the camera, and the amount of blur of that object in the image, and that the depth information in a visual scene may be obtained by modeling the effect that a camera's focal parameters have on the image. Attempts have also been made to generate depth maps from blur without knowledge of camera parameters by assuming a general monotonic relationship between blur and distance. However, extracting depth from blur may be a difficult and/or unreliable task, as the blur found in images can also arise from other factors, such as lens aberration, atmospheric interference, fuzzy objects, and motion. In addition, a substantially same degree of blur arises for objects that are farther away and that are closer to the camera than the focal plane of the camera. Although methods to overcome some of these problems and to arrive at more accurate and precise depth values have been disclosed in the art, they typically require more than one exposure to obtain two or more images. A further disadvantage of this approach is that it does not provide a simple way to determine depth values for regions for which there is no edge or texture information and where therefore no blur can be detected.
A recent U.S. patent application 2008/0247670, which is assigned to the assignee of the current application and is by the same inventors, discloses a method of generation surrogate depth maps based on one or more chrominance components of the image. Although these surrogate depth maps can have regions with incorrect depth values, the perceived depth of the rendered stereoscopic images using the surrogate depth maps has been judged to provide enhanced depth perception relative to the original monoscopic image when tested on groups of viewers. It was speculated that depth is enhanced because in the original colour images, different objects are likely to have different hues. Each of the hues has its own associated gray level intensity when separated into its component color images and used as surrogate depth maps. Thus, the colour information provides an approximate segmentation of “objects” in the images, which are characterized by different levels of grey in the color component image. Hence the color information provides a degree of foreground-background separation. In addition, slightly different shades of a given hue would give rise to slightly different gray level intensities in the component images. Within an object region, these small changes would signal small changes in relative depth across the surface of the object, such as the undulating folds in clothing or in facial features. Because using color information to substitute for depth can lead to depth inaccuracies, in some cases the visual perception of 3D images generated using these surrogate depth maps can be further enhanced by modifying these depth maps by changing the depth values in selected areas.
Generally, regardless of the method used, depth maps generated from 2D images can contain objects and/or regions with inaccurate depth information. For example, a tree in the foreground could be inaccurately depicted as being in the background. Although this can be corrected by a user through the use of a photo editing software by identifying and selecting object/regions in the image and then changing the depth contained therein, this task can be tedious and time-consuming especially when this has to be done for images in which there are many different minute objects or textures. In addition, the need to manually correct all similar frames in a video sequence can be daunting. Furthermore, even though commercially available software applications for generating depth maps from standard 2D images can be used for editing of depth maps, they typically involve complex computations and require long computational time. For example, one commercial software allows for manual seeding of a depth value within an object of an image, followed by automatic expansion of the area of coverage by the software to cover the region considered to be within an “object,” such as the trunk of a tree or the sky; however, where and when to stop the region-growing is a computationally challenging task. Furthermore, for video clips the software has to track objects over consecutive frames and this requires further complex computations.
Furthermore, having an efficient method and tools for modifying depth maps can be advantageous even when the original depth map sufficiently reflects the real depth of the actual scene from which the image or video was created, for example for creating striking visual effects. For example, just as a director might use sharpness to make a figure stand out from a blurred image of the background, a director might want to provide more depth to a figure to make it stand out from a receded background.
Accordingly, there is a need for efficient methods and systems for modifying existing depth maps in selected regions thereof.
In particular, there is a need to reduce computational time and complexity to enable the selection of pixels and regions to conform to object regions such that they can be isolated and their depth values adjusted, for improved contrast or accuracy. Being able to do that manually for one image frame and then automatically repeat the process for other image frames with similar contents is a challenge.
An object of the present invention is to provide a relatively simple and computationally efficient method and a graphical user interface for modifying existing depth maps in selected regions thereof for individual monoscopic images and monoscopic video sequences.