As will be appreciated, there are a variety of computer interface devices for control by hand namely, the mouse, the pen, the joystick, the trackball, and more recently, the data glove. While these devices are presently satisfactory for many applications, some systems require more flexibility for convenient computer control.
By way of example, data gloves which fit over the human hand with an umbilical line to the computer control the motion of icons such as flying figures which move through a virtual reality scene. The use of such a data glove is both cumbersome and expensive in view of the numbers of internal sensors within the data glove and the trouble of having to take it on and off. As a result, researchers have searched for computer control systems which are not so hardware dependent. Gesture recognition is one such class of systems.
The detection of gestures is important because not only does the orientation of the hand give valuable information, so does hand movement. Thus, while a thumbs up static gesture may indicate approval, this same gesture when moving can indicate "thumbing" or the request for a ride. Likewise, although the attitude of a hand is detectable, it is the detection of its dynamic movement which more accurately defines the gesture.
In the past, as reported in the March 1992 IEEE conference proceedings, IEEE catalog number 92CH3168-2, entitled "Recognizing Human Action In Time Sequential Images Using Hidden Markov Model" by Yamato, Ohya and Ishii of the NTT Human Interface Laboratories in Yokosuka, Japan, a hand gesture recognition system is described that takes static pictures of some action and utilizes the Hidden Markov Model to infer which of a set of possible gestures a given video input corresponds to. However, such an approach, which was originally developed for speech recognition purposes, can be computationally intense. Another problem with respect to this approach to gesture recognition is that it measures motion only inferentially. This is due to the fact that motion between various of the pictures is never represented or calculated.
As described in a paper delivered at the Imagina '93 Conference entitled "A Human Motion Image Synthesizing By Model-Based Recognition From Stereo Images" authored by Ishii, Mochizuki and Kishino, another approach to vision-based hand gesture recognition employs a stereo camera method. Here a model of the human figure is employed, with the model being fit to stereo range data in order to infer the angles between the joints and therefore the orientation of the arm or hand.
The most serious problem with such a system is that it is model based in that if one wishes to have this work on other than a human figure, a new model must be introduced. As will be appreciated, this system is not a "low level" system because it relies on high level models in the recognition process.
Additionally, as described in MIT Media Laboratory Vision And Modeling Group Technical Report No. 197 entitled "Recognition of Space Time Gestures Using a Distributed Representation" by Trevor J. Darrell and Alex P. Pentland, gestures are detected from a series of templates, not unlike a series of pictures. Gestures are identified in this system by a sequence of static hand positions, where a particular hand position is determined by taking the template and convolving it with the entire image to find the best fit. Even though this technique offers a so-called "low level" approach because high level models are not used, the Darrell/Pentland technique is even more computationally intensive than the above-mentioned Yamato-Ohya-Ishii system due to the need for convolution with large masks. Also, being intensity-based, this system is not particularly robust for changes of lighting, and like the other systems described above, does not measure motion directly but rather analyzes a sequence of static poses.
By way of further background, it will be appreciated that so-called "orientation histograms" have been utilized for texture analysis. This work is described by Mojgan Monika Gorkani of the MIT Media Laboratory, published in the MIT Media Laboratory Perceptual Computing Group Technical Report, No. 222, May, 1993. In this paper, orientation histograms are developed for the purpose of analyzing "textures" by looking at local peaks in the orientation histogram. However, detecting only histogram peaks throws out or destroys certain relevant information which is useful in analyzing static or dynamic gestures.
As an application for gesture recognition, recently, there has been interest generated in so-called teleconferencing. In teleconferencing, rather than transmitting full frame video, various scenarios are depicted at the teleconferencing site. What is actually shown to the teleconference participants is determined by, for instance, hand gestures or head gestures, or a combination of both. Such a system is described in IEEE Transactions on Pattern Analysis of Machine Intelligence, Volume 15, No. 6, June 1993, in an article entitled "Visually Controlled Graphics" by A. Azarbayejani, T. Starner, B. Horowitz, and A. Pentland. In this system corner points are detected as the features of interest, with the corners then being tracked in space and time to determine head position. It will be appreciated that this system is not particularly well adapted to articulated objects, like the human hand.
In summary, most hand-controlled human-computer interface devices have severe limitations. The mouse, the pen and the trackball only give two-dimensional position information. Also, the joystick only gives information about two angles. All these methods require physical hardware for the hand to hold, which is cumbersome to have to carry, pick-up or grasp.
In an effort to get away from physical hardware, model-based visual methods for recognizing hand gestures have been developed, but tend to be slow, because there are many possible ways a hand could be fitted to the visual data. Furthermore, model-based methods require the generation of a new model, and potentially a redesign of the entire algorithm in order to extend the work to analyze non-hand inputs.
It will be appreciated that what people perceive as a gesture is not simply a series of static snapshot type poses of a particular object, such as a hand, but rather what is perceived is the motion of the hand between these static poses. Thus while a system which tries to measure gestures may incorporate the static snapshots of the object as it goes through its motion, it should also describe or recognize the motion itself. Since none of the above systems measure motion directly, they are incapable of the type of gesture recognition which is required.