The field of computer vision includes the computer analysis of scenes projected into an electronic camera. The camera generates images of the scenes, and the computer analyzes these images and draws useful conclusions.
In particular, an active branch of computer vision is devoted to computing the position and orientation in space of an object, also called object pose, by detecting several features of the object in a single image from a single camera or in two images from two cameras.
Implementations using two cameras apply well-known stereometric techniques, in which the position of each feature in 3 D can be obtained by triangulation from the positions of the projection of this feature in each of the two images. For more details on stereometric techniques, see the book titled "Robot Vision", by Berthold K. P. Horn, MIT Press. This type of technique has several drawbacks. First, this system requires two cameras, which increases system cost. Second, calibrating the relative positions of the two cameras is difficult, and the system output is very sensitive to calibration errors. Third, generating the rotation matrix for an object requires lengthy trigonometric computations, and combining data from more than 3 object points requires matrix inversion computations. This results in increased hardware cost in situations where real time system response is needed.
In stereometric techniques the position of each object feature in space is found individually, without making use of additional information, such as the relative positions is available, other techniques are preferable, because they can recover the position and orientation of the object from a single image. For example, if 3 points of an object are detected in a single image and the distance between theses features in the object is known, it is possible to recover the pose of the object. However, a polynomial equation must be solved, and 2 or 4 solutions for the object pose are found. See for example "New Exact and Approximate Solutions of the Three-Point Perspective Problem", by Daniel DeMenthon and Larry Davis, 1990 International Conference on Robotics and Automation, Cincinatti, pp. 40-45. If more than 3 points are used, the solution becomes unique, but the formulas become more complicated, and would be practical only with costly hardware in real time use. See for example "An Analytical Solution for the Perspective-4-Point Problem", by Radu Horaud, Bernard Conio and Olivier Leboulleux, Computer Vision, Graphics, and Image Processing, vol. 47, pp. 33-44, 1989. One would like to choose 5 points or more to increase the reliability of the object pose results, but is faced with highly difficult mathematical computations.
An alternative approach that uses much simpler computations assumes well-known approximations to perspective projection, called orthographic projection and scaled orthographic projection. Scaled orthographic projection is an improved version of orthographic projection in which changes of scales due to the distance between the object and the camera are accounted for. Such an approach is taken for example by Ullman and Basri in "Recognition by Linear Combinations of Models", A.I. Memo no. 1152, August 1989, Massachusetts Institute of Technology Artificial Intelligence Laboratory. These authors find 3 precomputed projections of the points of the object by orthographic projection in 3 known spatial orientations. Then they approximate a new image of the points of the object as a scaled orthographic projection. They show that any new projected image can be expressed as a linear combination of the 3 precomputed projections. The coefficients of the linear combination are recovered using the image and a precomputed matrix based on the 3 precomputed projections. Then these coefficients can be used for combining the 3 rotation matrices used in the 3 precomputed poses to obtain the rotation matrix of the object. The translation of the object can also be recovered easily by computing a scaling factor. Finding the rotation and translation of an object is not explicitely taught by the authors, because their final goal is the recognition of an object from images instead of its pose, but can be easily deduced from the explanations of the authors. An advantage of this method is that the rotation matrix is obtained directly without any trigonometric operation on angles such as Euler angles. However, the computation requires combining several images of the object, which is a more complex and less reliable procedure when compared with the inventive features disclosed below.
In an approach related to that of Ullman and Basri, Carlo Tomasi and Takeo Kanade use orthographic projection to write a system of equations from a sequence of images. This approach is presented in "Shape and Motion from Image Streams: A Factorization Method. 2. Point Features in 3 D Motion", Technical Report CMU-CS-91-105, Carnegie Mellon University, January 1991. By this method, the structure of the object can be recovered as well as the rotation matrix of the object for each of the images of the sequence. Disadvantages of this system include (1) inversions of large matrices that have to be performed at run time and (2) not recovering the translation of the object.
In contrast, according to this invention, the orientation and translation of the object can be obtained in a very direct and fast way from a single image of the object by:
(a) Multiplying a precomputed matrix depending only on the relative positions of the points of the object by two vectors depending only on the positions of the projections of the feature in the image;
(b) Normalizing the two resulting vectors to obtain their norms and the first two rows of the rotation matrix;
(c) Taking the cross-product of these two vectors to obtain the third row;
(d) Multiplying a known vector by one of the norms to obtain the translation vector;
The rotation matrix and the translation vector are a very good approximation to the actual rotation matrix and translation vector, provided the distances between the points of the object being used are small compared to their distances to the camera. Also, many points can be used for the object for improved reliability without any changes in the steps above.
One embodiment of this invention is a system for measuring the motions of an operator, for example measuring the displacement of his head or his hand. A virtual scene of virtual objects is modified according to these measurements. With this system the operator can interact with a virtual scene displayed in front of his eyes by using the motions of his head or his hand. The operator may want to observe a part of the virtual scene out of his present field of view; the system detects the rotation of his head and generates the part of the virtual scene corresponding to the new field of view. Also, in another embodiment the operator may hold a specially designed object in his hand. The system displays the motions of this object and displays a corresponding virtual object in the virtual scene. This virtual object may be used as a pointing cursor and more generally as a tool to interact with the other virtual objects of the scenery.
An early implementation of such concepts using a mechanical mouse was popularized by the Macintosh computer; the operator's displacements in two dimensions are sensed by the mouse and are translated into the motion of a cursor in a two-dimensional (2D) virtual world of documents, files, folders. Interaction of the 2D cursor with the objects of this 2D world allows the operator to drag files into folders, scroll pages, drop documents into a trash can, etc.
However, in more and more applications, a virtual world of three-dimensional (3D) objects is represented on a display, or a pair of displays providing stereo vision, and the operator must be able to translate and rotate these objects. Some attempts have been made to decompose 3D motions into a sequence of 2D motions, so that a 2D input device could be used to manipulate 3D objects. However, producing 3D motions of objects with this decomposition method is time consuming, non intuitive and frustrating. Furthermore if the operator decides to bring back an object to its original position, he must remember the sequence of motions and follow it in exact reverse order.
To solve these problems, several devices which sense 3D motions of the operator have been proposed. Transducers measure these displacements and transmit them to the computer. For example, U.S. Pat. No. 4,988,981 to Zimmerman and Lanier, 1991, entitled "Computer Data Entry and Manipulation Apparatus and Method", describes a glove worn by the operator, on which translation and orientation sensors are attached. Translation is detected by use of an ultrasonic transmitter attached to the glove and three ultrasonic receivers positioned around the display. Orientation is detected by a low frequency magnetic field transmitter attached to the glove and a field detection system next to the display. The measured translation and rotation parameters are used to position a hand-shaped cursor on the display screen of the host computer according to the position and orientation of the operator's hand in space. Flex sensors are also provided for measuring the degree of flex of fingers. Fingers may be represented with similar flex on the hand-shaped cursor, and may allow refined communication methods with the virtual world.
Instead of being mounted on a glove, orientation and translation sensors may be enclosed in a box or a pen that the operator holds in his hand and displaces in space. One such system, called the Bird is made by Exos, Inc., Burlington. Mass. Other systems were described by Paul McAvinney in "Telltale Gestures--3-D applications need 3-D input", Byte, July 1990, pp. 237-240. These systems apply triangulation techniques as well, between several transmitters, either ultrasonic or magnetic, and several receivers. They require relatively complex hardware and are relatively expensive.
Optical techniques have been applied instead of magnetic and ultrasonic techniques for operator interaction with computer generated display. An example of a computer vision system is set forth in U.S. Pat. No. 4,649,504 to Krouglicof, 1987, entitled "Optical Position and Orientation Techniques". This patent disclosures a system for monitoring the position and orientation of a pilot's helmet, in which the features that are detected optically are light emitting diodes (LEDs). The LEDs are turned on and off in succession, and the each illuminated LED is detected by two light sensors equipped with camera lenses. Triangulation provides the corresponding 3D position of each considered LED. With 3 or more LEDs, the corresponding position and orientation of the helmet in space can be uniquely determined. The method described in this patent is essentially a stereometric technique, with all the related drawbacks discussed above.
In U.S. Pat. No. 4,891,630 to Friedman, 1990, entitled "Computer Vision System with Improved Object Orientation Technique", a system is described using a single camera for monitoring the head motion of an operator for eyetracking purposes. A camera takes images of a patch which is attached to the cheek of the operator. The patch has 4 small flat reflective elements at its corners and a large hemispheric reflective element at its center. Reflections of a light source on these elements are detected in images taken by the camera. Reflections from the small flat elements are point-like reflections from locations which are fixed with respect to the patch, whereas reflections from the surface of the large hemispheric element may come from various locations on this surface, depending on the orientation of the patch. Therefore, when the operator moves his head, these reflections move differently in the image whether they come from the flat elements or from the hemispherical element, and formulas for head angles changes using these reflection differences are provided. However these formulations can provide only qualitative angle changes, and are valid only for very small angle changes. They are sufficient for the specific application described in the patent, but would provide incorrect results if they were applied to tracking the large displacements of an object held in the hand of an operator. In contrast the apparatus in the present disclosure gives valid results for large displacements of an object.
An example of display cursor control by optical techniques is presented in U.S. Pat. No. 4,565,999 to King et al., 1986, entitled "Light Pencil". A device fixed to the head of the operator comprises 4 LEDs. A photodetector placed above the computer display senses the variations of intensity of the LEDs and a processor relates these variations to changes in orientation of the LEDs with respect to the photodetector. However, this system is intended for the control of horizontal displacement of a cursor on the display by the operator's vertical and horizontal rotations. It does not provide a way to detect other motions such as translations or roll, and therefore cannot be applied to general 3D object pose monitoring