The present invention relates to an image recording apparatus and method for recognizing the shape and/or movement of an image on the basis of a captured range image or range image stream.
Conventionally, upon recognizing three-dimensional motions such as motions of the hand, face, and the like of a person, the object to be recognized such as the hand, face, or the like is sensed from its front side using an image sensing apparatus such as a video camera or the like. Then, recognition is made by estimating three-dimensional motion using limited changes in two-dimensional (without any depth information) motion that appears in the sensed image, and various other kinds of knowledge.
Some recognition methods will be explained blow.
The first method estimates motion using feature points of the object to be recognized. In this method, some feature points are set in advance on the object to be recognized, and motion is estimated using a change in positional relationship between the feature points. For example, upon recognizing a horizontal shake (horizontal rotation) of the face, several feature points of the face are set at the eyes, nose, and the like, and a clockwise shake of the face is estimated from changes, e.g., the feature points at the positions of the eyes have moved horizontally, the spacing between the feature points at the two eyes has decreased, the feature point at the right eye has disappeared (since the right eye has moved to a position that cannot be seen from the camera), and so forth upon movement of the face.
However, when this method is used, markers and the like must be pasted at the positions of the feature points of the face to stably obtain the corresponding points in a camera image, and the environment that can use this method is limited. In some cases, no markers are used. However, in such case, feature points cannot be stably extracted, and much computation cost is required to obtain feature points.
Another method estimates motion by obtaining changes in motion moment. This method exploits the fact that when a hand is rotated about a vertical axis, the forward projection area of the hand in the horizontal direction changes dramatically, but it does not change much in the vertical direction. In such case, rotation of the hand about the vertical axis is estimated solely because the motion moment of the hand in the horizontal direction changes considerably.
This method can estimate three-dimensional motion. However, since the shape of the object that can be used in recognition is limited, and different two-dimensional motions can hardly be distinguished from each other, recognition errors readily occur.
Also, a method of estimating motion from the geometric shape of the object to be recognized is known. For example, when three-dimensional motion of a dice is to be recognized, it is estimated that the dice has been cast when the one pip is seen via the camera at a given timing, and then it changes to the three pips. Since this method exploits knowledge about geometric stereoscopic information of the object to be recognized, three-dimensional motion can be relatively reliably estimated. However, objects that can be recognized are limited. In addition, geometric knowledge about that object is required, resulting in poor versatility.
Also, various other methods are available. However, in these methods, since three-dimensional motion is estimated from an image that has only two-dimensional information, it is difficult to stably recognize three-dimensional motion with high precision. At the time of capturing an image of a three-dimensional object by a camera as two-dimensional information, a large number of pieces of important information are lost.
To avoid these problems, an object is simultaneously sensed by a plurality of video cameras at several positions, corresponding points among the cameras are obtained to compute stereoscopic information from a plurality of sensed images, and three-dimensional motion is obtained using the computed information.
In this method, since the stereoscopic information is defined based on a plurality of sensed images in practice, problems posed when three-dimensional information is estimated from two-dimensional information can be solved. However, since computations of the corresponding points used to stereoscopically combine images from the plurality of cameras require much time, this method is not suitable for a real-time process. In order to obtain corresponding points, since camera position information is required, the camera positions are limited and they must be calibrated.
As described above, the conventional methods for recognizing three-dimensional motion from an image suffer various problems.
In the conventional method, since the object to be recognized is captured using, e.g., a video camera, as an image having only two-dimensional information, three-dimensional motion must be recognized based on only the two-dimensional information, and it is hard to stably recognize three-dimensional motion with high precision.
Also, the object to be recognized must be prepared in advance as a template or a recognition dictionary, resulting in cumbersome operations. Also, the templates and recognition dictionary must be modified in correspondence with the object to be recognized, resulting in high cost.
Furthermore, matching with a huge number of templates is required upon recognition, and a long recognition time is required.