Tremendous progress in the computational capability of integrated electronics and increasing sophistication in the algorithms for smart video processing has lead to special effects wizardry, which creates spectacular images and otherworldly fantasies. It is also bringing advanced video and image analysis applications into the mainstream. Furthermore, video cameras are becoming ubiquitous. Video CMOS cameras costing only a few dollars are already being built into cars, portable computers and even toys. Cameras are being embedded everywhere, in all variety of products and systems just as microprocessors are.
At the same time, increasing bandwidth on the Internet and other delivery media has brought widespread use of camera systems to provide live video imagery of remote locations. This has created a desire for an increasingly interactive and immersive tele-presence, a virtual representation capable of making a viewer feel that they are truly at the remote location. In order to provide coverage of a remote site for a remote tele-presence, representations of the environment need to be created to allow realistic viewer movement through the site. The environment consists of static parts (building, roads, trees, etc.) and dynamic parts (people, cars, etc.). The geometry of the static parts of the environment can be modeled offline using a number of well-established techniques. None of these techniques has yet provided a completely automatic solution for modeling relatively complex environments, but because the static parts do not change, offline, non-real time, interactive modeling may suffice for some applications. A number of commercially available systems (GDIS, PhotoModeler, etc.) provide interactive tools for modeling environments and objects.
For general modeling of static scenes, site models provide a viable option. However, site models do not include appearance representations that capture the current and changing appearance of the scene. The dynamic components of a scene cannot, by definition, be modeled once and for all. Even for the static parts, the appearance of the scene changes due to varying illumination and shadows, and through modifications to the environment. For maintaining up-to-date appearance of the static parts of the scene, videos provide a cost-effective and viable source of current information about the scene.
U.S. Pat. No. 6,084,979, “Method for Creating Virtual Reality,” T. Kanade, P. J. Narayan, and P. Rander describes a method of creating images from virtual viewpoints using a dynamically changing internal representation. This internal representation is a three dimensional object-centered model of the scene which is created in a two step process from the images of 51 video cameras mounted in a hemispherical dome. Though the image quality of this system is generally high, the computational complexity of creating the necessary internal representation means that this system operates offline, which makes it unacceptable as an approach for tele-presence. Also the vast amount of video data that needs to be handled for each frame has lead the CMU group to reduce the frame rate to 6 Hz.
It has been previously demonstrated that current videos of a semi-urban environment can be aligned in near real-time to site models. The textured models can then be rendered using standard graphics pipelines. A visual metaphor for this process of combining models with videos is that of video flashlights. The multiple camera views at a given time instant can be considered as video flashlights capturing the scene appearance from their respective viewpoints. The multiple appearances are coherently combined with the model to provide multiple users the ability to navigate through the environment while viewing the current appearance from the video flashlights. A description of video flashlights is contained in “Pose Estimation, Model Refinement, and Enhanced Visualization using Video” by S. Hsu, S. Samarasekera, R. Kumar, and H. S. Sawhney which appears in CVPR2000.
While site models and the previously demonstrated video flashlights method provide for very impressive remote viewing systems, they fall somewhat short of the desired interactive tele-presence. For realistic tele-presence of dynamic objects such as human beings, not only the rendering, but also the modeling should be done in real-time. For example, as a person is moving around within a constrained environment such as a room or a courtyard, the users would like to virtually walk around the person under user control. In order to provide, continuously changing viewpoints under user control, it is desirable for representations of the dynamic object to be continuously built and maintained.
In the traditional graphics pipeline based rendering, scene and object models stored as polygonal models and scene graphics are rendered using z-buffering and texture mapping. The complexity of such rendering is dependent on the complexity of the scene. Standard graphics pipeline hardware has been optimized for high performance rendering.
In tele-presence applications with dynamic scenes, however, both modeling and rendering are desirably performed online in real-time. The method used needs to be applicable to a wide variety of scenes that include human objects, yet should not preclude capture and rendering of other scenes. Therefore, the assumption that a geometric model may be available is unrealistic. For human forms, it may be argued that assuming a generic model of the body and then fitting that model to images may be a viable approach. Still, there are unsolved issues of model to image correspondence, initialization and optimization that may make the approach infeasible.
Image-based modeling and rendering, as set forth in “Plenoptic Modeling: An Image-Based Rendering System” by L. McMillan and G. Bishop in SIGGRAPH 1995, has emerged as a new framework for thinking about scene modeling and rendering. Image-based representations and rendering potentially provide a mix of high quality rendering with relatively scene independent computational complexity. Image-based rendering techniques may be especially suitable for applications such as tele-presence, where there may not be a need to cover the complete volume of views in a scene at the same time, but only to provide coverage from a certain number of viewpoints within a small volume. Because the complexity of image-based rendering is of the order of the number of pixels rendered in a novel view, scene complexity does not have a significant effect on the computations.
For image-based modeling and rendering, multiple cameras are used to capture views of the dynamic object. The multiple views are synchronized at any given time instant and are updated continuously. The goal is to provide 360 degrees coverage around the object at every time instant from any of the virtual viewpoints within a reasonable range around the object.
In order to provide control of zoom for many users at the same time, it is not feasible to use zoom lenses and cameras. Physical control of zoom through zoom lenses can be done for only one viewpoint at a time, and only by one user. Synthetic control of resolution based on real data can provide a limited control of resolution. Typically, such a control may be able to provide at least 233 magnification without appreciable loss of quality.
Between the real cameras, virtual viewpoints may be created by tweening images from the two nearest cameras. Optical flow methods are commonly used by themselves to create tweened images. Unfortunately, the use of only traditional optical flow methods can lead to several problems in creating a tweened image. Particularly difficult are the resolution of large motions, especially of thin structures, for example the swing of a baseball bat; and occlusion/deocclusions, for example between a person's hands and body.