In human stereo vision, each eye captures a slightly different view of the scene being observed. This difference, or disparity, is due to the baseline distance between the left and right eye of the viewing subject, which results in a different viewing angle and a slightly different image of the scene captured by each eye. When these images are combined by the human visual system, these disparities (along with several other visual cues) allow the observer to gain a strong sense of depth in the observed scene.
Stereo image pairs (created either digitally, through animation or computer generated imagery (CGI), or by traditional photography) exploit the ability of the human brain to combine slightly different images resulting in perception of depth. In order to mimic this effect, each stereo image pair consists of a left eye image and a right eye image. Each complimentary image differs in the same manner as the image captured by a human left and right eye would when viewing the same scene. By presenting the left eye image only to the left eye of a viewer, and the right eye image only to the right eye, the viewer's visual system will combine the images in a similar manner as though the viewer were presented with the scene itself. The result is a similar perception of depth.
Presenting the appropriate images to the left and right eye requires the use of a stereo apparatus, of which there are a number of variations on the setup. For viewing a film sequence of stereo images, however, a common setup includes a pair of left and right digital projectors each projecting the left and right eye image respectively of the stereo pair on to the same film screen space. Each projector has a lens which polarizes the light leaving the projector in a different manner. The viewer wears a pair of 3D eyeglasses, the viewing lenses of which have a special property. The left-eye viewing lens screens out light of the type of polarization being projected by the right camera, and vice versa. As a result, the left eye sees only the image being projected by the left eye projector, and the right eye sees only the image being projected by the right eye projector. The viewer's brain combines the images as mentioned above, and the stereo perception of depth is achieved. The projectors can be placed side by side, but are often stacked on top of one another in a fashion that minimizes the distance between the projection sources.
An alternative setup substitutes the pair of digital projectors with a single projector which alternately displays left eye/right eye images above some minimum display rate. The projector has a synchronized lens which switches polarization in time with the alternate display of the images to keep the left eye and right eye images correctly polarized. Again, a pair of appropriately polarized 3D eyeglasses are worn by the viewer to ensure that each eye only sees the image intended for that eye. A similar approach is employed by the high speed synchronized left- and right-eye imaging found in modern 3D-capable digital televisions.
Although these setups may be suitable for the viewing of stereo image pairs, there are a number of variations on the apparatus that can achieve a stereo depth effect. Essentially, any apparatus that allows for the presentation of two corresponding different images, one to each eye, can potentially be used to achieve the stereo depth effect.
Capturing a stereo pair of images with the aim of reproducing the depth effect as described above is relatively simple. For example, a stereo camera rig can be set up with a pair of synchronized cameras that capture a scene simultaneously. The cameras are separated by a sufficient baseline to account for the distance between the eyes of an average human viewer. In this manner, the captured images will effectively mimic what each individual eye of the viewer would have seen if they were viewing the scene themselves.
There exists, however, a substantial library of film (or “image streams”) in the industry that were captured by only a single camera. Thus, these image streams only contain two-dimensional information. Various methods have been attempted to convert these 2D image streams into three-dimensional image streams, most providing reasonable results only after expending significant effort and cost.
Creating a sequence of complimentary stereo image pairs from a given sequence of one or more images, each captured with only a single camera, to induce the perception of three dimensional depth, has been a difficult problem. The pair must be constructed carefully to mimic the differences a human visual system would expect to observe in a stereo pair as described above, or the perception of depth will fail and the viewer will see an unpleasant jumble of scene elements. Not only must each image pair be correctly organized and/or reconstructed, but the sequence of image pairs must be organized and/or reconstructed consistently so that elements in the scene do not shift unnaturally in depth over the course of the sequence.
The present industry accepted approach to creating a sequence of stereo pairs from a sequence of single 2D images involves three very costly steps.
First, the image sequence of one of the images in the stereo pair is rotoscoped. Rotoscoping is a substantially manual and complicated process performed on image sequences involving outlining every element in a frame and extending that over a filmed sequence, one frame at a time. This requires a human operator to manually process almost every frame of a sequence, tracing out the scene elements so that they can be selected and separately shifted in the image. Common elements in film can take hours and days to manually rotoscope just a few seconds of a completed shot. Despite being a complex task, rotoscoping results in a rather limited, low-quality selection. For example, in order to separately select a subset of an actor's face so that each element can be modified separately, in addition to outlining the actor, each element would have to be outlined or traced frame by frame for the duration of the scene. Selecting elements at this detail is known as a form of segmentation. Segmentation refers to the selection or sub-selections, or parts, of an image (for example, the individual pieces of an actor's face) and keeping those parts separate for creative and technical control. In a more complex scene, with high-speed action and various levels of detail and crossing objects, rotoscoping as a segmentation tool, becomes extremely inefficient due to the increase in complexity of the scene itself. Rotoscoping thus becomes a very cost-intensive process, and one of the reasons converting 2D to 3D has been so expensive and time consuming.
Close-up shots of actors are very common and present numerous problems for artists using rotoscoping and/or other outlining techniques to create a proper separation of the actor from even a simple background. For example, creating a conversion that successfully includes the fine hairs and other details on an actor's head in a close-up camera shot which appear frequently in feature films could take between 1-3 days by a competent artist depending on the segmentation detail required. The technique becomes substantially more difficult in a crowd scene.
Patents have issued for computer enhanced rotoscoping processes for use in converting 2D images into 3D images, such as that described by U.S. Pat. No. 6,208,348 to Kaye, incorporated herein by reference; however, these technologies have done little more than speed up the process of selecting and duplicating objects within the original image into a left-right stereo pair. Each object must still be manually chosen by an outlining mechanism, usually by an operator drawing around the object using a mouse or other computer selection device, and the objects then must be repositioned with object rendering tools in a complementary image and precisely aligned over the entire image sequence in order to create a coherent stereoscopic effect.
Second, for life-like 3D rendering of 2D film that approaches the quality of CGI or film shot by a true stereo 3D camera, the 3D geometry of the scene represented by the image must be virtually reconstructed. The geometry creation required for such a reconstruction is difficult to automate effectively, and each rotoscoped element must be assigned to its respective geometry in the scene. The geometry must then also be animated over the sequence to follow scene elements and produce the desired depth effect. The 2D to 3D conversion of Harry Potter and the Half-Blood Prince (2009) involved a similar technique. Each object in the original 2D scene was analyzed and selected by a graphic artist, 3D object models or renditions created from their 2D counterparts, and the scene completely or partially recreated in 3D to generate depth information appropriate to create a stereoscopic image. IMAX Corporation's computer system processed the information to generate the correct offset images in the complimentary images of the stereo pair. See Lewis Wallace, Video: How IMAX Wizards Convert Harry Potter to 3-D, WIRED.COM, Aug. 6, 2009 (last visited Aug. 26, 2010), http://www.wired.com/underwire/2009/08/video-how-imax-wizards-convert-harry-potter-to-3-d. Significant drawbacks of recreating entire scenes in 3D include requiring a perfect camera track and solution for every shot, countless manual labor hours and/or artist oversight to create complex geometry to perfectly match and animate within the environment, and enormous processing power and/or time to render those elements together. Similarly, the approach of U.S. Pat. No. 6,208,348 to Kaye applies the curvature of simple shapes such as cylinders and spheres (as shown by FIGS. 12E, F, G, H of that patent) to the image to create a semblance of dimensionality, which is extremely limiting, and results in images that are not truly life-like.
Third, the elements of the scene are then shifted or moved horizontally and placed in the complimentary image of the stereo pair. Shifting of scene elements is necessary in order to produce the disparities between the first and second eye that the human visual system would expect to observe in a stereo image pair. However, in captured images, the process of shifting 2D elements reveals ‘holes’ that were previously occluded by the shifted elements. Essentially, no visual information exists due to the movement of the occlusions. For example, in a digital image of a person standing in front of a store, the image of the person hides, or occludes, a portion of the store in the background. If this person is digitally shifted, no image information will remain where the person was originally positioned in the image. These image areas left blank by the process of shifting elements must be refilled. Whether the scene was reconstructed and re-imaged, or whether the rotoscoped elements were shifted manually in the image to produce the disparities required for depth perception, one or both images in the pair will have missing information. That is, occluding objects in the scene, once shifted in the reconstruction or otherwise, will reveal portions of the scene for which there is no information contained in the image. This missing information is very difficult to automatically create in general, and requires a human operator to manually fill in this information on virtually every frame. U.S. Pat. No. 6,208,348 to Kaye describes a method of pixel duplication to fill the holes by repeating an equivalent number of pixels horizontally in the opposite direction of the required directional placement. However, this “pixel repeat” results in a very unrealistic image, and thus manual painting of those holes frame by frame is usually required for an optimal result.
Over the years, as described above, there has been a collective effort by those in the visual effects industry engaged in 2D to 3D conversion to create new visual material for the occlusions or blanks. How to create new occluded visual information was a primary topic of discussion at industry trade shows. It was thought that creating the occluded new visual information was the logical thing to do because it best simulates the experience with binocular vision.
What has been needed, and heretofore unavailable, is a system and process that avoids the need for the time and cost intensive practice of rotoscoping or manually processing each frame of a sequence by tracing out the scene elements, building or reconstructing 3D geometry, 3D scene tracking, as well as image reconstruction and mapping and high-quality rendering of image information, all while, at the same time, providing a reliable system and process for rapidly transforming a 2D monocular sequence into a sequence of stereo image pairs, reducing human interaction, and improving fidelity and detail.