This invention relates generally to the field of three-dimensional virtual reality environments and models, and, more particularly, to building virtual reality world models from multiple-viewpoint real world images of scenes.
In the field of computer graphics, there is a need to build realistic three-dimensional (3D) models and environments that can be used in virtual reality walk-throughs, animation, solid modeling, visualization, and multimedia. Virtual reality environments are increasingly available in a wide variety of applications such as marketing, education, simulation, entertainment, interior and architectural design, fashion design, games and the Internet to name but a few.
Many navigable virtual environments with embedded interactive models tend to be very simplistic due to the large amount of effort that is required to generate realistic-3D virtual models behaving in a realistic manner. Generating quality virtual reality scene requires sophisticated computer systems and a considerable amount of hand-tooling. The manual 3D reconstruction of real objects, by using CAD-tools is usually time consuming and costly.
The Massachusetts Institute of Technology, the University of Regina in Canada, and Apple Computer, Inc. jointly created the xe2x80x9cVirtual Museum Projectxe2x80x9d which is a computer-based rendering of a museum which contains various objects of interest.
As the user moves through the virtual museum individual objects can be approached and viewed from a variety of perspectives.
Apple Computer also has developed the Quicktime VR(trademark) system that allows a user to navigate within a virtual reality scene generated from digitized overlapping photographs or video images. However, warping can distort the images so that straight lines appeared curved, and it is not possible to place 3D models in the scene.
Three-dimensional digitizers are frequently used to generate models from real world objects. Considerations of resolution, repeatability, accuracy, reliability, speed, and ease of use, as well as overall system cost, are central to the construction of any digitizing system. Often, the design of a digitizing system involves a series of trade-offs between quality and performance.
Traditional 3D dimensional digitizers have focused on geometric quality measures for evaluating system performance. While such measures are objective, they are only indirectly related to an overall goal of a high quality rendition. In most 3D digitizer systems, the rendering quality is largely an indirect result of range accuracy in combination with a small number of photographs used for textures.
Prior art digitizers include contact digitizers, active structured-light range-imaging systems, and passive stereo depth-extraction. For a survey, see Besl xe2x80x9cActive Optical Range Imaging Sensors,xe2x80x9d Advances in Machine Vision, Springer-Verlag, pp. 1-63, 1989.
Laser triangulation and time-of-flight point digitizers are other popular active digitizing approaches. Laser ranging systems often require a separate position-registration step to align separately acquired scanned range images. Because active digitizers emit light onto the object being digitized, it is difficult to capture both texture and shape information simultaneously. This introduces the problem of registering the range images with textures.
In other systems, multiple narrow-band illuminates, e.g., red, green, and blue lasers, are used to acquire a surface color estimate along lines-of-sight. However, this is not useful for capturing objects in realistic illumination environments.
Passive digitizers can be based on single cameras or stereo cameras. Passive digitizers have the advantage that the same source images can be used to acquire both structure and texture, unless the object has insufficient texture.
Image-based rendering systems can also be used, see Nishino, K., Y. Sato, and K. Ikeuchi, xe2x80x9cEigen-Texture Method: Appearance Compression based on 3D Model,xe2x80x9d Proc. of Computer Vision and Pattern Recognition, 1:618-624, 1999, and Pulli, K., M. Cohen, T. Duchamp, H. Hoppe, L. Shapiro, and W. Stuetzle, xe2x80x9cView-based Rendering: Visualizing Real Objects from Scanned Range and Color Data, xe2x80x9d Proceedings of the 8th Eurographics Workshop on Rendering, pp. 23-34, 1997. In these systems, images and geometry are acquired separately with no explicit consistency guarantees.
In image-based vision systems, there are two inherent and some what related problems. The first problem has to do with deducing the camera""s intrinsic parameters. Explicit calibration of intrinsic parameters can be circumvented in specialized processes but is common in many existing systems. The second problem is concerned with estimating the camera""s extrinsic parameters i.e., camera position/motion relative to the environment or relative to the object of interest. Estimating the camera positions is an essential preliminary step before the images can be assembled into a virtual environment.
The terms xe2x80x98camera positionxe2x80x99 and xe2x80x98camera motionxe2x80x99 are used interchangeably herein, with the term xe2x80x98camera positionxe2x80x99 emphasizing the location and the orientation of a camera, and the term xe2x80x98camera motionxe2x80x99 indicating a sequence of camera positions as obtained, for example, from a sequence of images.
The first problem, of calibrating a camera""s intrinsic parameters, is well studied. Solutions for calibrating a single camera are too many to enumerate. Solutions for calibrating stereo cameras are also well known. There, the simple requirement is to have some overlap in the images acquired by the stereo cameras. Calibrating rigid multi-camera systems where there is no overlap of the viewed scene in the different cameras has, however, not been the subject of previous work.
In the prior art, the second problem, of estimating camera position, can be solved in a number of ways. For generating a 3D model of a portable object, one method rigidly fixes the cameras at known locations, and rotates the object on a turntable through precise angular intervals while taking a sequence of images. Great care must taken in setting up and maintaining the alignment of the cameras, object, and turntable. Therefore, this type of modeling is usually done in a studio setting, and is of no use for hand-held systems.
Another method for generating a 3D model of an object of interest assumes a known xe2x80x9cposition-registration patternxe2x80x9d somewhere in the field of view. The term xe2x80x9cposition-registration patternxe2x80x9d is used here to indicate a calibration pattern that enables computation of the camera position relative to the pattern, in a fixed coordinate frame defined by the pattern. For example, a checkerboard pattern is placed behind the object while images are acquired. However, this method for computing camera position also has limitations. First, it is difficult to view the object from all directions, unless the position-registration pattern is relocated and the system is re-calibrated. Second, the presence of the pattern makes it more difficult to identify the boundary of the object, as a precursor to further processing for building a 3D model, than would be the case with a bland, low-texture background.
Obviously, the two techniques above are not practical for imaging large-scale, walk-through environments. In that case, the varying position of architectural details in the image, as the camera is moved, can be used to determine camera motion. However, these scenes often includes a large amount of extraneous movement or clutter, such as people, making it difficult to track image features between successive images, and hence making it difficult to extract camera position.
Motion parameters are more easy to resolve when the camera has a wide field of view, because more features in a scene are likely to be visible for use in the motion computation, and the motion computations are inherently more stable when features with a wide angular spacing relative to the observer are used. Computation of camera position/motion is also easier when working from images of rigid structure, or known geometry. However, high-quality color images may not be necessary.
But wide field of view images contain distortion which is usually too great for the images to be of use with applications that are concerned with geometric accuracy. Methods exist for removing the distortion, but the corrected images generally suffer from a loss of quality. Furthermore, a wide field of view means that there are fewer pixels viewed for any given angle of view.
Thus the desire for computing camera position/motion conflicts with the desire for generating a 3D model of an object or environment. Computing camera motion works best with a wide field of view, and known scene characteristics such as rigidity or known geometry, and absence of scene characteristics such as specular or transparent surfaces which can adversely affect the recovery of position. But 3D scanning can involve applications where the environment is uncontrolled e.g., independent motion can impact the ability to collect reliable estimates of camera position/motion, or there may be a lot of clutter making automatic analysis difficult. A narrow field of view camera is preferable for acquiring the images used to make a 3D model, because this type of camera allows more detailed information to be collected per image than a wide-angle camera at the same distance.
Further, computing camera position can work acceptably with cheaper monochrome cameras, whereas scanning a 3D model generally requires high-quality color imagery to acquire details of surface texture,
Therefore, when building a 3D virtual model from images, there are conflicting requirements which are not being met by traditional vision systems.
The invention provides a method for constructing a 3D model of a scene, by acquiring first images of a first scene having unknown characteristics with a first camera. Corresponding second images of a second scene having known characteristics are acquired by a second camera. The first and second cameras having a fixed physical relationship to each other. Only the second images are analyzed to determine corresponding positions of the second camera while acquiring the first images, and only the first images are assembled into a 3D model using the determined corresponding positions and the fixed physical relationship of the first and second camera. The 3D model can then be rendered on a display device. The known characteristics of the second scene can include a rigid structure, a position-registration pattern, or a fixed set of visual beacons, each with a unique distinguishable identity. Furthermore, the images can be obtained without constraining motions of the first and second cameras. In one embodiment, the first camera is oriented at right angles with respect to the second camera, and the first and second cameras each have a field of view less than 90xc2x0.