Stereoscopic imaging is the process of visually combining at least two images of a scene, taken from slightly different viewpoints, to produce the illusion of three-dimensional (“3D”) depth. This technique relies on the fact that human eyes are spaced some distance apart and do not, therefore, view exactly the same scene. By providing each eye with an image from a different perspective, the viewer's eyes are tricked into perceiving depth. Typically, where two distinct perspectives are provided, the component images are referred to as the “left” and “right” images, also know as a reference image and complementary image, respectively. However, those skilled in the art will recognize that more than two viewpoints may be combined to form a stereoscopic image.
In 3D post-production, visual effects (“VFX”) workflow and 3D display applications, an important process is to infer a depth map from stereoscopic images consisting of left eye view and right eye view images. For instance, recently commercialized autostereoscopic 3D displays require an image-plus-depth-map input format, so that the display can generate different 3D views to support multiple viewing angles.
The process of infering the depth map from a stereo image pair is called stereo matching in the field of computer vision research since pixel or block matching is used to find the corresponding points in the left eye and right eye view images. More recently, the process of inferring a depth map is also known as depth extraction in the 3D display community. Depth values are infered from the relative distance between two pixels in the images that correspond to the same point in the scene.
Stereo matching of digital images is widely used in many computer vision applications (such as, for example, fast object modeling and prototyping for computer-aided drafting (CAD), object segmentation and detection for human-computer interaction (HCI), video compression, and visual surveillance) to provide three-dimensional (3-D) depth information. Stereo matching obtains images of a scene from two or more cameras positioned at different locations and orientations in the scene. These digital images are obtained from each camera at approximately the same time and points and each of the images are matched corresponding to a 3-D point in space. In general, points from different images are matched by searching a portion of the images and using constraints (such as an epipolar constraint) to correlate a point in one image to a point in another image.
There has been substantial work on depth map extraction. Most of the work on depth extraction focuses on single stereoscopic image pairs rather than videos. However, videos instead of images are the dominant media in the consumer electronics world. For videos, a sequence of stereoscopic image pairs are employed rather than single image pairs. In conventional technology, a static depth extraction algorithm is applied to each frame pair. In most cases, the qualities of the output depth maps are sufficient for 3D playback. However, for frames with a large amount of texture, temporal jittering artifacts can be seen because the depth maps are not exactly aligned in the time direction, i.e., over a period of time for a sequence of image pairs.
Therefore, a need exists for techniques to stabilize the depth map extraction process along the time direction to reduce the temporal jittering artifacts.