In general terms, displaying and estimating the shape of an object in the real three-dimensional world utilizing one or more two-dimensional images, is a fundamental question in the area of computer vision. The depth perception of a scene or an object is known to humans mostly because the vision obtained by each of our eyes simultaneously could be combined and formed the perception of a distance. However, in some specific situations, humans could have a depth perception of a scene or an object with one eye when there is additional information, such as lighting, shading, interposition, pattern or relative size. That is why it is possible to estimate the depth of a scene or an object with a monocular camera, for example.
New lenticular liquid crystal display (LCD) technology allows the display of still and moving pictures with a three-dimensional user perception without the use of stereo three-dimensional glasses, for example. In other words, in a three-dimensional LCD, a sheet of cylindrical lenses (lenticulars) is placed on top of an LCD in such a way that the LCD image plane is located at the focal plane of the lenses. This means that the rays from the eye of an observer looking perpendicularly at the lenses are focused on the portion of the LCD that is in the middle under each lens. Similarly, the rays from an eye looking at the screen from a sideways angle are concentrated on the LCD off-centre underneath each lens. If the LCD underneath each lens is divided into different sub-pixels, then eyes looking at the screen under different angles see different pixels. Furthermore, if the correct image information is put on the different pixels (i.e., a stereo pair of images), then the observer will see three-dimensionally. Therefore, the image processing and LCD driving requires that a depth map be provided together with the flat 2D pictures.
With the continuing increase of the three-dimensional display market, not all video content can become “three-dimensional” at once. Therefore, there is a strong need and desire for developing three-dimensional techniques, which can provide users the ability to interpret two-dimensional information in a three-dimensional sense. Reconstruction of three-dimensional images or models from two-dimensional video sequences has important ramifications in various areas, with applications to recognition, surveillance, site modelling, entertainment, multimedia, medical imaging, video communications, and a myriad of other useful technical applications. This pseudo-three-dimensional case consists in extracting the relevant depth information from flat video contents. Specifically, depth extraction from flat two-dimensional content is an ongoing field of research and several techniques are known. For instance, there are known techniques specifically designed for generating depth maps based on the movements of the objects in question.
A common method of approaching this problem is analysis of several images taken at the same time from different view points, for example, analysis of disparity of a stereo pair or from a single point at different times, for example, analysis of consecutive frames of a video sequence, extraction of motion, analysis of occluded areas, etc. Others techniques yet use other depth cues like defocus measure. Still other techniques combine several depth cues to obtain reliable depth estimation.
For example, EP 1 379 063 A1 to Konya discloses an example of depth extraction from two-dimensional images based on image segmentation. In particular, it describes a mobile phone that includes a single camera for picking up two-dimensional still images of a person's head, neck and shoulders, a three-dimensional image creation section for providing the two-dimensional still image with parallax information to create a three-dimensional image and a display unit for displaying the three-dimensional image.
However, the conventional techniques described above for three-dimensional design are not often satisfactory due to a number of factors. Systems that propose to extract depth from two-dimensional video sequences are mostly based on temporal motion estimation, which assumes generally that a closer object will have the highest movement. This implies a very computationally intensive process, requiring heavy computational analysis. Moreover, conventional three-dimensional design approaches fall short with systems based on defocus analysis when there is no noticeable focussing disparity, which is the case when pictures are captured with very short focal length optics, or poor quality optics, which is likely to occur in low-cost consumer devices, and systems combining several clues are very complex to implement and hardly compatible with a low-cost platform. As a result, lack of quality, robustness, and increased costs contribute to the problems faced in these existing techniques. Therefore, it is desirable to generate depth perception for three-dimensional imaging from two-dimensional objects such as video and animated sequences of images using an improved method and system which avoids the above mentioned problems and can be less costly and simpler to implement.