In combination, gesture and speech constitute the most important modalities in human-to-human communication. People use a large variety of gestures, either to convey what cannot always be expressed using speech only, or to add expressiveness to the communication. There has been a considerable interest in incorporating both gestures and speech as a means for improving the design and implementation of interactive computer systems, a discipline referred to as Human-Computer Interaction (HCI). In addition, with the tremendous growth in demand for novel technologies in the surveillance and security field, the combination of gestures and speech are important sources for biometric identification.
Although speech and gesture recognition have been studied extensively, most of the attempts at combining them in a multimodal interface to improve their classification have been semantically motivated, e.g., word—gesture co-occurrence modeling. The complexity of semantic analysis limited the state of the art of systems that employ gestures and speech to the form of predefined signs and controlled syntax such as “put <point> that <point> there”, which identifies the co-occurrence of the keyword “that” and a pointing gesture. While co-verbal gesticulation between humans is virtually effortless and exceedingly expressive, “synthetic” gestures tend to inflict excessive cognitive load on a user, consequently defeating the purpose of making HCI natural. Part of the reason for the slow progress in multimodal HCI is the prior lack of available sensing technology that would allow non-invasive acquisition of signals (i.e., data) identifying natural behavior.
The state of the art in continuous gesture recognition is far from meeting the “naturalness” requirements of multimodal Human-Computer Interaction (HCI) due to poor recognition rates. Co-analysis of visual gesture and speech signals provides an attractive prospect for improving continuous gesture recognition. However, lack of a fundamental understanding of the underlying speech/gesture production mechanism has limited the implementation of such co-analysis to the level where a certain set of spoken words, called keywords, can be statistically correlated with certain gestures, e.g., the term “OK” is correlated with the familiar gesture of the thumb and forefinger forming a circle, but not much more.
Although the accuracy of isolated sign recognition has reached 95%, the accuracy of continuous gesture recognition in an uncontrolled setting is still low, nearing an accuracy level of only 70%. A number of techniques have been proposed to improve kinematical (visual) modeling or segmentation of gestures in the traditionally applied HMM frameworks. Nevertheless, due to a significant share of extraneous hand movements in unconstrained gesticulation, reliance on the visual signal alone is inherently error-prone.
Multimodal co-analysis of visual gesture and speech signals provide an attractive means of improving continuous gesture recognition. This has been successfully demonstrated when pen-based gestures were combined with spoken keywords. Though the linguistic patterns significantly deviated from the one in canonical English, co-occurrence patterns have been found to be effective for improving the recognition rate of the separate modalities. Previous studies of Weather Channel narration have also shown a significant improvement in continuous gesture recognition when those were co-analyzed with selected keywords. However, such a multimodal scheme inherits an additional challenge of dealing with natural language processing. For natural gesticulation, this problem becomes even less tractable since gestures do not exhibit one-to-one mappings of form to meaning. For instance, the same gesture movement can exhibit different meanings when associated with different spoken context; at the same time, a number of gesture forms can be used to express the same meaning. Though the spoken context is extremely important in the understanding of a multimodal message and cannot be replaced, processing delays of the top-down improvement scheme for gesture recognition negatively affect the task completion time.
Signal-level fusion has been successfully applied from audio-visual speech recognition to detection of communication acts. Unlike lip movements (visemes), gestures have a loose coupling with the audio signal due to the involvement of the different production mechanisms and frequent extraneous hand movements. The variety of articulated movements also separates hand gestures from the rest of the non-verbal modalities. For instance, while head nods, which have been found to mark accentuated parts of speech, have only several movement primitives, gestures are shaped by the spatial context to which they refer. These factors notably complicate the audio-visual analysis framework that could be applied for continuous gesture recognition.
In pursuit of more natural gesture based interaction, the present inventors previously introduced a framework called iMap. In iMap, a user manipulates a computerized map on a large screen display using free hand gestures and voice commands (i.e., keywords). A set of fundamental gesture strokes (i.e., strokes grouped according to similarity in the smallest observable arm movement patterns, referred to as “primitives”) and annotated speech constructs were recognized and fused to provide adequate interaction. The key problem in building an interface like iMap is the lack of existing natural multimodal data. In a series of previous studies, data from Weather Channel broadcasts was employed to bootstrap gesture-keyword co-analysis in the iMap framework. The use of the Weather Channel data offers virtually unlimited bimodal data. Comparative analysis of both domains indicated that the meaningful gesture acts are co-verbal and consist of similar gesture primitives.
In human-to-human communication, McNeill distinguishes four major types of gestures by their relationship to speech. Deictic gestures are used to direct a listener's attention to a physical reference in the course of a conversation. These gestures, mostly limited to pointing, were found to be coverbal. From previous studies in the computerized map domain, over 93% of deictic gestures were observed to co-occur with spoken nouns, pronouns, and spatial adverbials. A co-occurrence analysis of the weather narration data revealed that approximately 85% of the time when any meaningful strokes are made, they are accompanied by a spoken keyword mostly temporally aligned during and after the gesture. This knowledge was previously applied to keyword level co-occurrence analysis to improve continuous gesture recognition in the previously-mentioned weather narration study.
Of the remaining three major gesture types, iconic and metaphoric gestures are associated with abstract ideas, mostly peculiar to subjective notions of an individual. Finally, beats serve as gestural marks of speech pace. In the Weather Channel broadcasts the last three categories constitute roughly 20% of the gestures exhibited by the narrators.
Extracting relevant words and associating these relevant words with gestures is a difficult process from the natural language understanding (computational processing) point of view. In addition, gestures often include meaningless but necessary movements, such as hand preparation and retraction; however, only meaningful parts of gestures (strokes) can properly be associated with words. Further, the ambiguity of associating gesture motion and gesture meaning (e.g., the same gestures can refer to different meanings) makes the problem of associating gestures with words even more difficult.