The present invention relates to an image processing technique required for acquiring depth information of a real space in real time without any delay. The present invention also relates to an image merging technique required for providing a consistent augmented reality or mixed reality to the observer. The present invention further relates to a storage medium of a program for image processing.
For example, in an augmented or mixed reality presentation system using an optical see-through HMD (head mounted display) or the like, when a real world and virtual world are merged in a three-dimensionally matched form, the depth (front and behind) ordering of real objects and virtual objects must be correctly recognized to render the virtual objects in a form that does not conflict with that depth ordering. For this purpose, depth information (three-dimensional information) of the real world must be acquired, and that acquisition must be done at a rate close to real time.
Since the time required for forming or acquiring a depth image is not negligible, time lag or latency is produced between a real world and a video world presented to the observer on the basis of that depth image obtained a predetermined time ago. The observer finds this latency or time lag disturbing.
In order to remove such latency, conventionally, an attempt is made to minimize the delay time by high-speed processing. For example, in xe2x80x9cCMU Video-Rate Stereo Machinexe2x80x9d, Mobile Mapping Symposium, May 24-26, 1995, Columbus, Ohio, images from five cameras are pipeline-processed to attain high-speed processing.
However, even by such high-speed processing, a delay time around several frames is produced. As a depth image obtained with a delay time of several frames does not reflect a change in real world that has taken place during that delay time (movement of an object or the observer), it does not accurately represent the actual (i.e., current) real world. Therefore, when the depth ordering of the real and virtual worlds is discriminated using this depth image, it produces inconsistency or conflict, and the observer experiences intolerable incoherence. In addition, high-speed pipeline processing is limited, and the delay time cannot be reduced to zero in principle.
This problem will be explained in detail below using FIG. 1. Assume that the observer observes the real world at the same viewpoint as that of a camera for the sake of simplicity.
Referring to FIG. 1, if reference numeral 400 denotes an object (e.g., triangular prism-shaped block) in a real space, an augmented reality presentation system (not shown) in this example, presents an augmented reality image in which a virtual object 410 (e.g., a columnar block) is merged to a position behind the real object 400 to the observer. The augmented reality presentation system generates a depth image of the real object 400 from images taken by a camera that moves together with the observer, and discriminates the depth ordering of the real object 400 and virtual object 410 on the basis of this depth image upon presenting an image of the virtual object 410.
Assume that the observer has moved his or her viewpoint to P1, P2, P3, and P4 in turn, and is currently at a viewpoint P5. At the viewpoint P5, the observer must be observing a scene 5005.
If a depth image of the scene 5005 (a depth image 5105 of the scene 5005 obtained by observation from the viewpoint P5) is obtained, the augmented reality presentation system can generate a virtual image 4105 with an occluded portion 600, and can render these images in a correct occlusion relationship, i.e., can render a scene 5205 (FIG. 3) in which the virtual image 410 is partially occluded by the object 400.
However, since this augmented reality presentation system requires a time xcex94t for its internal processing, a depth image to be used at the viewpoint P5 for augmented reality presentation is the one at an old viewpoint xcex94t before the viewpoint P5 (the viewpoint P2 in FIG. 1 will be used to express this old position for the sake of simplicity). That is, at the current time (i.e., the time of the viewpoint P5 in FIG. 1), a depth image 5102 corresponding to a scene 5002 at the viewpoint P2 xcex94t before the current time can only be obtained.
At the viewpoint P2, the object 400 could be observed at a rightward position as compared to the scene 5005, and its depth image 5102 could correspond to the scene 5002. Hence, when the depth ordering of the real and virtual worlds at the viewpoint P5 is discriminated in accordance with this old depth image 5102, since a virtual image 4102 with an occluded portion 610, is generated, as shown in FIG. 4, an image of the front real object 400 is presented to the observer as the one which is occluded by the virtual image 4102 of the virtual object 410, and by contrast, an image of the virtual object 410 presented to the observer has a portion 610 which ought not to be occluded but is in fact occluded, and the virtual object 410 also has a portion 620 which ought to be occluded but is in fact not occluded, as shown in FIG. 5.
In this way, if augmented reality is presented while ignoring the time xcex94t required for generating a depth image, an unnatural, contradictory world is presented.
As a prior art that points out problems with real-time stereo processing based on high-speed processing implemented by hardware, Yasuyuki Sugawa and Yuichi Ota, xe2x80x9cProposal of Real-time Delay-free Stereo for Augmented Realityxe2x80x9d is known.
This article proposes predicting a future depth image. That is, this article proposes an algorithm that can reduce system latency from input to output as much as possible by executing high-speed disparity estimation that uses the stereo processing result of previous images and utilizes time correlation, parallel to disparity estimation by stereo.
However, this article is premised on used of a stationary camera, and cannot cope with a situation where the camera itself (i.e., a position/posture of viewpoint of the observer) moves.
The present invention has been made to solve the conventional problems, and has as its object to provide a depth image measurement apparatus and method, that can acquire a depth image of a real world in real time without any delay.
It is another object of the present invention to provide an image processing apparatus and method, which can present a three-dimensionally matched augmented reality image even when the viewpoint of the observer moves, and to provide an augmented reality presentation system and method.
It is still another object of the present invention to provide an image processing apparatus and method, which can present a three-dimensionally matched augmented reality image, continuously in particular, and to provide an augmented reality presentation system and method.
According to a preferred aspect of the present invention, the second viewpoint position at which the second depth image is to be generated is that of the image input means at the second time, to which the image input means has moved over a time elapsed from the first time at which the image input means input the stereo image.
According to a preferred aspect of the present invention, the second time is a time elapsed from the first time by
a known first processing time required for depth image processing in the calculation means, and
a second processing time required for depth image warping processing by the warping means.
According to a preferred aspect of the present invention, the image input means (or step) inputs a stereo image from stereo cameras.
According to a preferred aspect of the present invention, the depth image generation means (or step) generates the stereo image or first depth image by triangulation measurement.
The viewpoint position can be detected based on an image input by the image input means without any dedicated three-dimensional position/posture sensor. According to a preferred aspect of the present invention, the position information estimation means (or step) estimates changes in viewpoint position on the basis of the stereo image input from the stereo cameras attached to the observer.
The viewpoints can be accurately detected using a dedicated position/posture sensor. According to a preferred aspect of the present invention, the position information estimation means (or step) receives a signal from a three-dimensional position/posture sensor attached to the camera, and estimates changes in viewpoints on the signal.
According to a preferred aspect of the present invention, the depth image warping means (or step) calculates a coordinate value and depth value of one point on the second depth image, which corresponds to each point on the first depth image, by three-dimensional coordinate transformation on the basis of the viewpoint position/posture information.
Other features and advantages of the present invention will be apparent from the following description taken in conjunction with the accompanying drawings, in which like reference characters designate the same or similar parts throughout the figures thereof.