Stereoscopic imaging is the process of visually combining at least two images of a scene, taken from slightly different viewpoints, to produce the illusion of three-dimensional depth. This technique relies on the fact that human eyes are spaced some distance apart and do not, therefore, view exactly the same scene. By providing each eye with an image from a different perspective, the viewer's eyes are tricked into perceiving depth. Typically, where two distinct perspectives are provided, the component images are referred to as the “left” and “right” images, also know as a reference image and complementary image, respectively. However, those skilled in the art will recognize that more than two viewpoints may be combined to form a stereoscopic image.
In three-dimensional (3D) post-production, visual effects (VFX) workflow and 3D display applications, an important process is to infer a depth map from stereoscopic images consisting of left eye view and right eye view images. For instance, recently commercialized autostereoscopic 3D displays require an image plus depth map input format, so that the display can generate different 3D views to support multiple viewing angles.
The process of inferring the depth map from a stereo image pair is called stereo matching in the field of computer vision research since pixel or block matching is used to find the corresponding points in the left eye and right eye view images. More recently, the process of inferring a depth map is also known as depth extraction in the 3D display community. Depth values are inferred from the relative distance between two pixels in the images that correspond to the same point in the scene.
Stereo matching of digital images is widely used in many computer vision applications (such as, for example, fast object modeling and prototyping for computer-aided drafting (CAD), object segmentation and detection for human-computer interaction (HCI), video compression, and visual surveillance) to provide 3D depth information. Stereo matching obtains images of a scene from two or more cameras positioned at different locations and orientations in the scene. These digital images are obtained from each camera at approximately the same time and points and each of the images are matched corresponding to a 3D point in space. In general, points from different images are matched by searching a portion of the images and using constraints (such as an epipolar constraint) to correlate a point in one image to a point in another image.
There has been substantial work done on depth map extraction. Most of the prior work on depth extraction focuses on single stereoscopic image pairs rather videos. However, videos instead of images are the dominant media in the consumer electronics world. For videos, a sequence of stereoscopic image pairs are employed rather than single image pairs. In conventional technology, a static depth extraction algorithm is applied to each frame pair. In most cases, the qualities of the output depth maps are sufficient for 3D playback. However, for frames with a large amount of texture, temporal jittering artifacts can be seen because the depth maps are not exactly aligned in the time direction, i.e., over a period of time for a sequence of image pairs. Conventional systems have proposed to stabilize the depth map extraction process along the time direction by enforcing smoothness constraints over the sequence of images. However, if there is large motion of the scene, motion of objects has to be taken into account in order to accurately predict the depth maps along the time direction.
Therefore, a need exists for techniques to stabilize the depth map extraction process along the time direction to reduce the temporal jittering artifacts. A further need exists for techniques for depth map extraction that takes into consideration object motion over time or over a sequence of images.