An attempt has been made to sense a real space by an image sensing apparatus mounted on a mobile object, and to express the sensed real space as a virtual space using a computer on the basis of the sensed photo-realistic image data (see, e.g., Endo, Katayama, Tamura, Hirose, Watanabe, & Tanikawa: “Method of Generating Image-Based Cybercities By Using Vehicle-Mounted Cameras” (IEICE Society, PA-3-4, pp. 276-277, 1997), or Hirose, Watanabe, Tanikawa, Endo, Katayama, & Tamura: “Building Image-Based Cybercities By Using Vehicle-Mounted Cameras(2)-Generation of Wide-Range Virtual Environment by Using Photo-realistic Images-” (Proc. of the Virtual Reality Society of Japan, Vol.2, pp. 67-70, 1997), etc.).
As a method of expressing a sensed real space as a virtual space on the basis of photo-realistic image data sensed by an image sensing apparatus mounted on a mobile, a method of reconstructing a geometric model of the real space on the basis of the photo-realistic image data, and expressing the virtual space using a conventional CG technique is known. However, this method has limits in terms of the accuracy, exactitude, and reality of the model. On the other hand, an Image-Based Rendering (IBR) technique, that expresses a virtual space using a photo-realistic image without any reconstruction using a model, has attracted attention. The IBR technique composes an image viewed from an arbitrary viewpoint on the basis of a plurality of photo-realistic images. Since the IBR technique is based on photo-realistic images, it can express a realistic virtual space.
In order to create a virtual space that allows walkthrough using such IBR technique, an image must be composed and presented in correspondence with the position in the virtual space of the user. For this reason, in such systems, respective frames of photo-realistic image data and positions in the virtual space are saved in correspondence with each other, and a corresponding frame is acquired and reproduced on the basis of the user's position and visual axis direction in the virtual space.
As a method of acquiring position data in a real space, a positioning system using an artificial satellite such as GPS (Global Positioning System) used in a car navigation system or the like is generally used. As a method of determining correspondence between position data obtained from the GPS or the like and photo-realistic image data, a method of determining the correspondence using a time code has been proposed (Japanese Patent Laid-Open No. 11-168754). With this method, the correspondence between respective frame data of photo-realistic image data and position data is determined by determining the correspondence between time data contained in the position data, and time codes appended to the respective frame data of photo-realistic image data.
The walkthrough process in such virtual space allows the user to view a desired direction at each position. For this purpose, images at respective viewpoint positions may be saved as a panoramic image that can cover a broader range than the field angle upon reproduction, a partial image to be reproduced may be extracted from the panoramic image on the basis of the user's position and visual axis direction in the virtual space, and the extracted partial image may be displayed.
When the image sensing apparatus is shaken, a panoramic image is also shaken. In such case, when shakiness of the image sensing apparatus is prevented by physical means such as a special vibration isolation device, rail, or the like, the image sensing apparatus cannot be freely moved, and the image sensing conditions are restricted. It is impossible in principle for the method using such physical means to reduce shakiness of the already sensed video.
When a video image process is used, shakiness of the already sensed video can be reduced. For example, when feature points in an image are detected, and are traced across a plurality of frames, the position and posture of a camera can be estimated on the basis of a set of the traced feature points by geometric calculations such as factorization or the like. Conventionally, such estimation of the position and posture of the camera can be implemented using commercially available match moving software. If the position and posture of the camera in each frame of the video can be estimated, a shakiness of the video can be reduced on the basis of the obtained estimated values of the position and posture of the camera.
The video image process using match moving software or the like, however, cannot simultaneously estimate the positions and postures of a plurality of cameras. Also, the estimated values of the position and posture of the camera calculated by the video image process contain errors. For this reason, when shakiness of images sensed by a plurality of cameras are reduced by the video image process for each camera, and the processed images are stitched to form a single panoramic image, the degree of overlapping of the seams between neighboring images varies for respective frames.