Humans are capable of perceiving depth or distance in a three-dimensional world because they are equipped with binocular vision. Their eyes are separated horizontally by about 2.5 inches, and each eye perceives the world from a slightly different perspective. As a result, images projected onto the retinas of two eyes are slightly different, and such a difference is referred to as binocular disparity. As part of the human visual system, the brain has the ability to interpret binocular disparity as depth through a process called stereopsis. The ability of the human visual system to perceive depth from binocular disparity is called stereoscopic vision.
The principles of stereopsis have long been used to record three-dimensional (3D) visual information by producing two stereoscopic 3D images as perceived by human eyes. When properly displayed, the stereoscopic 3D image pair would recreate the illusion of depth in the eyes of a viewer. Stereoscopic 3D images are different from volumetric images or three-dimensional computer graphical images in that they only create the illusion of depth through stereoscopic vision while the latter contain true three-dimensional information. One common way of recording stereoscopic 3D images includes using a stereoscopic 3D camera equipped with a pair of horizontally separated lenses with an inter-ocular distance equal or similar to the human eye separation. Like human eyes, each camera lens records an image, which by convention are called a left-eye image, or simply a left image, and a right-eye image, or simply a right image. Stereoscopic 3D images can be produced by other types of 3D image capture devices or more recently by computer graphics technology based on the same principle of stereopsis.
When a pair of stereoscopic 3D images are displayed to a viewer, the illusion of depth is created in the brain when the left image is presented only to the viewer's left eye and the right image is presented only to the right eye. Special stereoscopic 3D display devices are used to ensure each eye only sees a distinct image. Technologies used in those devices include polarizer filters, time-sequential shutter devices, wavelength notch filters, anaglyph filters and lenticular/parallax barrier devices. Despite the technology differences in those stereoscopic 3D display devices, the depth perceived by a viewer is mainly determined by binocular disparity information. Furthermore, the perceived size of an object in stereoscopic 3D images is inversely related to the perceived depth of the object, which means that the object appears small as it moves closer to the viewer. Finally, the inter-ocular distance of 3D camera also changes the perceived size of the object in resulting stereoscopic 3D images.
Stereoscopic 3D motion pictures are formed by a pair of stereoscopic 3D image sequences produced by stereoscopic 3D motion picture cameras or by computer graphics or a combination of both. In the following discussion, the term “3D” is used to mean “stereoscopic 3D,” which should not be confused with the same term used in describing volumetric images or computer graphical images that contain true depth information. Similarly, the term “disparity” is used to mean “binocular disparity.”
Producing a 3D motion picture is generally a more costly and more complex process than making a regular two-dimensional (2D) motion picture. A 3D motion picture camera is usually much bulkier and heavier than a regular 2D camera, and it is often more difficult to operate. Special expertise in 3D cinematography is required throughout the entire production process including capturing, VFX, rendering and editing in order to produce good 3D reality. To this day, there are only a relatively small number of 3D motion picture titles available in comparison with a vast library of 2D motion pictures.
An alternative approach of producing 3D motion pictures is to capture images in 2D and digitally convert the resulting footage into 3D images. The basic concept of this approach is that left and right images can be generated from an original 2D image, if appropriate disparity values can be assigned to every pixel of the 2D image. The disparity values of an object can be directly calculated from its depth values. An object closer to the viewer produces a larger disparity value than that resulting from a distant object. The disparity approaches zero when an object moves away towards infinity. To create believable 3D illusions from a 2D image, correct depth information is needed for the entire image, which can either be computed in some cases, or estimated based on viewer's subjective interpretation of the scene. All depth values assigned to image pixels forms an image referred to as a depth map, and the depth map is called dense if depth values are assigned for all pixels of the image. To convert an image sequence into 3D, dense depth maps are collected for all frames in the image sequence, and the resulting image sequence is a depth map sequence.
To directly estimate a depth map sequence closely matching the real-world scene captured in a 2D image sequence would be very a difficult task. Instead, it is common practice to indirectly estimate the depth maps by defining individual objects in a scene. An object is defined by its surface occupying a volume in a three-dimensional world, and it is also defined by its movement and deformation from one frame to next. Software tools are available to facilitate the task of defining objects using solid modeling, animation and other techniques. However, due to the existence of motion in a scene, modeling and animating all objects in a 2D scene can be a time-consuming and labor-intensive process.
Modeling an object requires that the object first be separated from the rest of the image over every frame. The most common methods for object separation are rotoscoping and matting. A rotoscoping method separates an object by tracing the contour of the object in every frame. A matting method includes extracting object masks based on luminance, color, motion or even sharpness resulting from lens focus. Both rotoscoping and matting methods are usually performed manually using various types of interactive software tools. Although many software tools provide keyframing and motion tracking capability to speed up the operation, object separation remains labor-intensive and time-consuming.
A dense depth map sequence can be computed after all objects have been defined for every frame of the image sequence. The disparity values are then calculated directly from depth values and used to generate 3D images. However, a dense depth map does not guarantee “dense” results. The resulting 3D images inevitably contain “holes” called occlusion regions. An occlusion region is a portion of an object which is occluded by another foreground object. Pixels within an occlusion region have no disparity values because they do not have correspondence in the original 2D images. In general, occlusion regions always accompany depth discontinuity. In some cases, an occlusion region may be filled with corresponding information about the background object revealed in other image frames. In other cases, the missing information needs to be “faked” or “cloned” in order to fill the holes. Improper occlusion region filling may result in visible artifacts in the 3D images.
For a given 2D scene, the size and distribution of occlusion regions in the converted 3D images are determined by the choice of camera parameters used for computing disparity from depth. Key camera parameters typically include camera position, inter-ocular distance and lens focal length. Normally, the camera parameters are selected based on the desired 3D look, but minimizing occlusion regions may also be a factor in consideration. The final 3D images are computed with a selected set of camera parameters and with all occlusion regions filled properly.
A full feature motion picture may contain thousands of image sequences called shots and each shot may have up to hundreds of image frames. Converting such a motion picture to 3D is a complex production task that requires a supporting production infrastructure comprising computing hardware, software, a management system and a process workflow. The production infrastructure should be scalable to meet challenging production schedules and adaptable to constant version changes to the motion picture footage. Such scalability and adaptability are important to meet motion picture release schedules, especially for the critical day-and-date release schedules. Converting a motion picture that contains computer generated (CG) scenes may be more efficient using the same production infrastructure because of availability of depth information.