Many applications, such as video editing, computer games, computer graphics for entertainment, and multimedia authoring, are based on the synthesis of video from a wide variety of sources of input data. In computer games, for example, the rendering of texture-mapped 3-D models at video rates is key to the realism of the game. In video editing applications, what differentiates on-line systems from off-line systems is the ability to composite multiple video streams in real-time. Video synthesis from still images can be found in multimedia CD's based on Apple Computer's QuickTime VR.TM., where virtual camera motion is synthesized using cylindrical panoramic mosaics constructed from sets of still images. Currently, such applications use highly specialized video synthesis techniques that are usually restricted to a single input data type and, generally, are based either on the use of 3-D models or on 2-D video representations that cannot support a full range of geometrically correct 3-D effects.
The effort involved in constructing 3-D graphical models presents a significant barrier to their widespread use. Many modern computer graphics systems, for example, are based on rendering texture-mapped polygons. Often, however, large numbers of polygons are required for visual realism, making 3-D model creation difficult and time-consuming. The technical challenge in producing 3-D models from images is even greater when integrating information from multiple views into a single 3-D representation.
Video editing systems, on the other hand, can combine multiple video streams using 2-D techniques, such as alpha blending, to generate video output without specifying complex 3-D models. An example of such a technique can be found in Kurtze et al., U.S. Pat. No. 5,644,364. Video editing can typically support image operations like translation, zooming, and planar warps. However, such systems lack a complete representation of the geometry of the scenes described by the video sequences. As a result, the types of 3-D effects that they can provide are extremely limited. For example, video editing systems cannot simulate a virtual change in the camera position in a manner that is guaranteed to be geometrically correct. Further, although these systems can handle occlusions by organizing several video streams into layers, they cannot handle self-occlusions within a layer, or more complex occlusion relations between layers. Consequently most 3-D effects are rendered off-line and then mixed in. Mosaic-based systems, such as Apple Computer's QuickTime VR (i.e., video synthesis that uses still images of a scene taken at different camera positions), can accurately simulate camera rotation and zooming, but cannot simulate virtual camera views with arbitrary translations because of limitations in the mosaic representation of scene geometry.
Image-based rendering (IBR) presents a compelling approach to image synthesis. IBR provides an alternative to the difficult process of building 3-D models from images, allowing the synthesis of new images of a static scene directly from a set of images. The 3-D geometric information is computed as needed while rendering a particular virtual view. This computation can operate at any desired level of detail and can therefore be adapted to the needs of the application. Moreover, IBR can produce high quality images even when the number of available sample images for a scene is small. While this dearth of image samples could frustrate the construction of a 3-D model, IBR can still produce new viewpoints in the vicinity of the image samples.
Although image-based rendering is a compelling approach to image synthesis, limitations in the current state of the art prevent their wide-spread application to video synthesis. The standard approach to image-based rendering, as described for example in "Novel View Synthesis in Tensor Space", by Avidan et al. in Conference on Computer Vision and Pattern Recognition, pp. 1034-1040, San Juan, Puerto Rico, June 1997, assumes that the motion in a set of input images results solely from the motion of the camera with respect to a static scene. In practice, however, there may be multiple rigid objects in a scene, each moving independently with respect to the camera. Moreover, some of these objects may even be articulated with non-rigid, kinematically-controlled motion. Thus, the standard IBR methods would be unable to synthesize such scenes.
There remains a need, therefore, for a method and apparatus that provide the advantages of IBR over current video synthesis techniques, such as 3-D modeling, video editing, and mosaic-based rendering, but are not limited to scenes with only a single rigid body in motion.