1. Technical Field
The present invention relates to the efficient representation and communication of synthesized perspective views of three-dimensional objects, and more specifically to paring down the number of Internet packets that must be sent in real-time at a network client's request to support interactive video sessions.
2. Description of the Prior Art
The limited bandwidth of Internet connections severely constrains the interactive real-time communication of graphical images, especially three-dimensional images. Ordinarily, dozens of video cameras would be trained on a single three-dimensional subject, each from a different perspective. A user could then pick one of the perspectives to view, or pick one that can be interpolated from several of the nearby perspectives. But sending all this information in parallel and computing all the interpolations that many users could request can overtax the server and its Internet pipe.
It is possible to represent solid objects with so-called “voxels”. These are the three-dimensional equivalents of pixels which are used to paint two-dimensional pictures. Each voxel has an x,y,z address in space, and a value that indicates whether the point is inside or outside the solid. The voxel map can be computed from the video images provided by a sufficient number of perspectives. The surface appearance of the solid can also be captured by each such camera. Interpolated intermediate images can be had by warping or morphing.
U.S. Pat. No. 5,613,048, issued to Chen and Williams, describes a first approach for interpolating solid structures. An offset map is developed between two neighboring images from correspondence maps. Such Patent is incorporated herein by reference.
A second approach uses the structural information about an object. Voxel information is derived from the video images provided by several cameras. Depth maps can be calculated for each camera's viewpoint, and is obtained from correspondences between surface points, e.g., triangulation. Another technique involves using silhouettes in intersections. Once the voxels for a solid are determined, intermediate (virtual) views can be obtained from neighboring (real) views.
Prior art methods for the three-dimension reconstruction of remote environments consume enormous computational and communication resources, and require far too many sensors to be economically feasible. So real-time applications are practically impossible with conventional techniques for modeling and rendering object appearance.
Recent advances at the “Virtualized Reality” laboratory at Carnegie Mellon University (CMU) demonstrate that real-time three-dimension shape reconstruction is possible. Video-based view generation algorithms can produce high-quality results, albeit with small geometric errors.
Research in three-dimension reconstruction of remote environments has shown that it is possible to recover both object appearance and sounds in remote environments. The methods for modeling object appearance, however, consume enormous computational and communication resources, and require far too many sensors to be economically feasible. These traits make real-time applications nearly impossible without fundamental algorithmic improvements. We therefore focus our attention on techniques for modeling and rendering object appearance, which can loosely be divided into three groups: direct three-dimension, image-space, and video-based.
Direct methods of three-dimension reconstruction measure the time-of-flight or phase variations in active illumination reflected from the scene. These measurements are converted directly into measurements of three-dimension distances. Because of their reliance on active illumination, multiple sensors can not co-exist in the same environment. As a result, they are inappropriate for real-time three-dimension reconstruction of complete environments.
Image-space methods create a database of all possible rays emanating from every object that point in all directions. To generate a new image, all the rays that pass through the desired viewpoint are projected on a plane. See, A. Katayama, K. Tanaka, T. Oshino, and H. Tamura, “A Viewpoint Dependent Stereoscopic Display Using Interpolation Of Multi-viewpoint Images”, SPIE Proc. Vol. 2409: Stereoscopic Displays and Virtual Reality Systems II, p. 11-20, 1995. And see, M. Levoy and P. Hanrahan, “Light Field Rendering”, SIGGRAPH '96, August 1996. Also, S. J. Gortler, R. Grzeszczuk, R. Szeliski, and M. F. Cohen, “The Lumigraph”, SIGGRAPH '96, 1996. Such references are all examples of image-space methods, and all can produce high-quality images. However, these techniques require thousands of viewpoints, making them impractical for real-time event capture.
Video-based modeling and rendering methods explicitly create three-dimension model structures and use real video images as models of scene appearance. A three-dimension model structure is extracted from a set of video images. New views are generated by projecting the original video images onto a three-dimension model, which can then be projected into the desired viewpoint.
Images from two viewpoints can be used to estimate the three-dimension structure in image-based stereo reconstruction. Given the positions, orientations, and focal lengths of the cameras, correspondences are used to triangulate the three-dimension position of each point of the observed surface. The output is called a depth image or range image. Each pixel is described with a distance, e.g., rather than color. A recent survey of stereo algorithms is given by U. R. Dhond and J. K. Aggarwal, in “Structure From Stereo—A Review”, IEEE Trans. On Pattern Analysis and Machine Intelligence, pp. 1489-1510, 1989. While stereo methods can provide three-dimension structure estimates, they are so-far unable to produce high-quality, high-accuracy results on a consistent basis across a reasonable variation in scene content.
The recovery of complete three-dimension models of a scene required multiple range images. This is because a single range image includes a three-dimension structure only for the visible surfaces. At the “Virtualized Reality” laboratory at Carnegie Mellon University, the present inventor, Takeo Kanade has shown that formulating this problem as a volumetric reconstruction process yields high-quality, robust solutions even in the presence of the errors made in the stereo processes. See, P. W. Rander, P. J. Narayanan, and T. Kanade, “Recovery of Dynamic Scene Structure from Multiple Image Sequences”, Int'l Conf. On Multisensor Fusion and Integration for Intelligent Systems, 1996.
The volume containing the objects can be decomposed into small samples, e.g., voxels. Each voxel is then evaluated to determine whether it lies inside or outside the object. When neighboring voxels have different status (i.e., one inside and one outside), then the object surface must pass between them. Such property is used to extract the object surface, usually as a triangle mesh model, once all voxels have been evaluated. The technique is similar to integration techniques used with direct three-dimension measurement techniques with some modifications to improve its robustness to errors in the stereo-computed range images. See, Curless and M. Levoy, “A Volumetric Method for Building Complex Models from Range Images”, SIGGRAPH '96, 1996. And see, A. Hilton, A. J. Stoddart, J. Illingworth, and T.Windeatt, “Reliable Surface Reconstruction From Multiple Range Images”, Proceedings of ECCV '96, pp. 117-126, April 1996. Also, M. Wheeler, “Automatic Modeling and Localization for Object Recognition”, Ph.D. thesis, Carnegie Mellon University, 1996.
A principle limitation of these methods is the processing speed. For example, CMU clustered seventeen Intel Pentium II-based PC's and inter-connected them with a 10-base-T ETHERNET network, and still needed more than 1000 seconds to process each second of video input.
Once a three-dimension structure is available, two methods can be used to generate arbitrary viewpoints. One method computes the “fundamental” appearance of the objects in the scene, independent of viewpoint. The result is a texture map for the three-dimension objects in the scene. This formulation maps well to modern hardware graphics accelerators because the core rendering primitive is a texture-mapped triangle. CMU and others have used this technique. See, A. Katkere, S. Moezzi, D. Y. Kuramura, and R. Jam, “Towards Video-based Immersive Environments”, MultiMedia Systems, vol. 5, no. 2, pp. 69-85, 1997; S. Moezzi, A. Katkere, D. Y. Kuramura, and R. Jam, “Reality Modeling and Visualization from Multiple Video Sequences”, IEEE Computer Graphics and Applications, vol. 16, no. 6, pp. 58-63, 1996; P. J. Narayanan, P. W. Rander, and T. Kanade, “Constructing Virtual Worlds Using Dense Stereo”, IEEE Int'l Conf. On Computer Vision, 1998; and, P. W. Rander, P. J. Narayanan, and T. Kanade, “Virtualized Reality: Constructing Time-Varying Virtual Worlds from Real World Events”, IEEE Visualization '97, 1997.
A second method skips the step of texture map creation. Instead, it maps the input images directly to the output image. Skipping the texture map creation helps avoid the quality degradations that might occur because of extra pixel transformations and any geometric errors in the three-dimension model. The three-dimension information is used to determine how each input pixel should map to the output, either a full three-dimension model or range images. The input images are essentially projected onto a three-dimension scene structure, so the structure can be projected into the desired output image, all in a single operation.
It is possible to individually weigh the contributions to the output image because each input image can be mapped separately to the output. For example, as the desired viewpoint approaches a real viewpoint, the weighting can emphasize the contribution of that real view while de-emphasizing the other real views. This technique has been explored by the present inventor, Takeo Kanade, and others. See, Chen and L. Williams, “View Interpolation For Image Synthesis”, SIGGRAPH '93, pp. 279-288, 1993. And see, B. Curless and M. Levoy, “A Volumetric Method For Building Complex Models From Range Images”, SIGGRAPH '96, 1996. Also, T. Kanade, P. J. Narayanan, and P. W. Rander, “Virtualized Reality: Concept And Early Results”, IEEE Workshop on the Representation of Visual Scenes, June 1995. And, P. W. Rander, P. J. Narayanan, and T. Kanade, “Virtualized Reality: Constructing Time-Varying Virtual Worlds From Real World Events”, IEEE Visualization '97, 1997; S. M. Seitz and C. R. Dyer, “Physically-Valid View Synthesis By Image Interpolation”, Proc. Workshop on Representation of Visual Scenes, pp. 18-25, 1995; and, S. M. Seitz and C. R. Dyer, “View Morphing”, SIGGRAPH '96, pp. 21-30, 1996.
Image-based methods are such that the view generation time is independent of scene complexity, so the rendering of purely virtual three-dimension content on low-end PCs can be speeded up. When a desired viewpoint exactly matches an input viewpoint, the output image is exactly the input image. As a result, the output contains no error, regardless of any error in the underlying three-dimension structure. Recent analysis at the “Virtualized Reality” laboratory at CMU has shown that the most critical three-dimension information that is needed is the boundaries between regions of the images. Especially those regions that correspond to surfaces at greatly different depths. See, H. Saito, S. Baba, M. Kimura, S. Vedula, and T. Kanade, “Appearance-Based Virtual View Generation Of Temporally-Varying Events From Multi-Camera Images In The Three-dimension Room”, Three-dimension Digital Imaging and Modeling (3D1M'99), October 1999. (Also CMU-CS-99-127).
Such boundaries, often called silhouettes or occluding contours, provide powerful visual cues to human observers. Methods that do not accurately describe such boundaries cause glaring errors that are easy to spot. In such cases, any realism of the virtual image vanishes. See, H. Saito, S. Baba, M. Kimura, S. Vedula, and T. Kanade, “Appearance-Based Virtual View Generation Of Temporally-Varying Events From Multi-Camera Images In The Three-dimension Room”, Three-dimension Digital Imaging and Modeling (3D1M'99), October 1999. Also, CMU-CS-99-127. In contrast, humans rarely detect inaccuracies of surface geometry because the human visual system is much less sensitive to this type of error.
Recent analysis has shown that identification of occluding contours is far more important than precise estimation of smooth surface structure. With this insight, recent efforts at CMU have focussed on recovering the three-dimension scene structure from the object silhouettes themselves. This process begins by extracting object silhouettes from the input images. These silhouettes are directly integrated in three-dimension to recover a three-dimension model of scene structure.
This process bears close resemblance to the earlier CMU work of using stereo to compute dense range images and then using integration to get a three-dimension model. In method embodiments of the present invention, a reconstruction process estimates correspondences only at silhouette boundaries. The correspondence estimation occurs directly in a volume, rather than using the intermediate representation of a range image. In method embodiments of the present invention computational cost is greatly reduced, and generated views are more realistic.
Useful three-dimension structures can be obtained without large computational expense. The video-based rendering techniques developed at CMU provide high-quality renderings that are immune to small geometric errors on continuous surfaces. These methods can be combined to create an interactive remote collaboration system. Reconstruction from several cameras at one end generates multiple video streams and a three-dimension model sequence. This information is then used to generate the novel viewpoints using video-based rendering techniques.
In constructing the system several factors influence the overall design, number of sites participating, number of people at each site, balance among computational resources, communication bandwidth, and communication latency.
For a two-site, one-person-per-site communication with relatively short communication latencies, it is possible to construct the three-dimension shape more efficiently than for a general number of viewers, because knowledge about the remote viewer location can guide the reconstruction process. Similarly, knowing the remote viewer location can be exploited to reduce communication bandwidth and to speedup the rendering process.
To further reduce the communication bandwidth needed, the transferred data can be compressed. For multi-camera video, each video could be encoded using MPEG algorithms. The three-dimension geometry could be reduced to a single bit per voxel, and could then be compressed using volumetric data structures, such as oct tree or run-length encoding.
Alternative methods for three-dimension modeling and rendering have been developed in recent years, e.g., direct three-dimension modeling and rendering, and also image-space modeling and rendering. The direct three-dimension methods estimate three-dimension structure directly from simple measurements of physical systems. The most common technique is to actively illuminate the scene with a laser, measure either the time or flight or the phase shift of the laser light reflected back to the source, convert this measurement to distance between sensor and illuminated surface, and then compute the surface's three-dimension position. The laser can then be scanned across the scene to capture many three-dimension points in the scene. These techniques are used in commercial products from K2T, Inc. (www. k2t. com) and Cyra Technologies, Inc., as well as in many custom systems. A modification of this technique is to scan a light stripe across the scene.
The other approach is to illuminate the entire scene several times and then to measure the returned light during precise intervals of time. Each illumination can yield another bit of depth resolution, so high resolution can be quickly achieved. This technology is incorporated in commercial products from 3DV Systems Ltd (www. 3dvsystems. com).
Either method yields three-dimension information, but only for the surfaces visible from the viewpoint of the sensor. Several researchers have developed algorithms to merge these results into complete three-dimension models. The methods based on volumetric integration have proven most successful. See B. Curless and M. Levoy, “A Volumetric Method For Building Complex Models From Range Images”, SIGGRAPH '96, 1996. A. Hilton, A. J. Stoddart, J. Illingworth, and T. Windeatt; “Reliable Surface Reconstruction From Multiple Range Images”, Proceedings of ECCV '96, pp. 117-126, April 1996; and M. Wheeler, “Automatic Modeling And Localization For Object Recognition”, Ph.D. thesis, Carnegie Mellon University, 1996.
Two limitations make direct three-dimension measurement impractical. First, multiple sensors cannot coexist in the same environment because of the active scene illumination. With multiple sensors, working simultaneously, the illumination from one sensor would interfere that of others. In addition, eye safety is always an issue when using lasers around humans. Second, scanning the space with a laser means that three-dimension measurements are made at different times. Such image, then, is actually a time-sequential sampling of shape, not a snapshot as is captured with a photograph. This sensor characteristic leads to apparent shape distortions for fast-moving objects.
Image-space modeling and rendering is an alternative to explicit three-dimension model recovery. It models all possible light rays emanating from the scene. An image can be considered as a two-dimension bundle of rays from this ray space, so the rendering in this case involves selecting the best rays to produce each pixel in the output image. A surprisingly simple version of this concept is the object movie in Apple QuickTime VR. Several hundred images are captured at precise positions around the object. A viewing program then lets the user manipulate the object, which looks to the user like three-dimension rotation of the real object. In fact, the viewing program is simply selecting the closest view from its database.
More sophisticated examples actually interpolate rays from the input images to synthesize new viewpoints. The first example of this approach was first presented in Katayama's work A. Katayama, K. Tanaka, T. Oshino, and H. Tamura, “A Viewpoint Dependent Stereoscopic Display Using Interpolation Of Multi-viewpoint Images”, SPIE Proc. Vol. 2409:Stereoscopic Displays and Virtual Reality Systems II , pp. 11-20,1995, which was recently extended into the light field, M. Levoy and P. Hanrahan, “Light Field Rendering”, SIGGRAPH '96, August 1996. Also see “The Lumigraph”, S. J. Gortler, R. Grzeszczuk, R. Szeliski, and M. F. Cohen, SIGGRAPH '96,1996. In the light field method, cameras are precisely positioned to directly sample all of the rays in the space, thereby completely filling the ray space. In the Lumigraph, an algorithm is presented to extrapolate images from a set of arbitrarily-placed cameras to fill the ray space.