In systems that use multimodal inputs, such as simultaneous speech, writing, and gesturing, for operating software applications, each unimodal input is typically time segmented and recognized by specialized input functions such as speech recognition, word processing, image recognition, and touch detection function, which produce individual multimodal interpretations. The time segments may be called turns. Each multimodal interpretation is characterized by being identified to a modality (i.e, the identity of the input and recognizer), being given a multimodal type and a confidence score, and values for a set of attributes associated with the multimodal type and modality are generated. The set of information that includes the identification of the modality, the multimodal type, the confidence score, and the attribute values is sometimes called a type feature structure. In some instances, the recognition function generates a plurality of multimodal interpretations from one input. For example, a gesture that points to a map may be interpreted as identifying the region of the map, or a hotel that is on the map. In such instances, the recognizer generates a set (in this example, two) of ambiguous multimodal interpretations, each of which typically has a lower confidence score than when one multimodal interpretation is generated. In some instance, the recognition function can generate a plurality of multimodal interpretations from one input that result from sequential actions (e.g., two gestures may be made during one turn). Such multimodal inputs are independent (not ambiguous).
The multimodal interpretations generated during a time segment are then analyzed as a set to determine a most probable meaning of them when interpreted together. One or more joint multimodal interpretations are generated and a unified type feature structure is generated for each joint multimodal interpretation. An application then uses the unified type feature structure as an input for the application.
In some reported implementations, such as that described in “Multimodal Interfaces That Process What Comes Naturally”, by Sharon Oviatt and Philip Cohen, Communications of the ACM, March 2000, Vol. 43, No. 3, when an ambiguous set of multimodal interpretations is generated, combinations are formed using different members of the set of ambiguous multimodal interpretations and the confidence scores of each multimodal interpretation are evaluated in a variety of ways to select a top-ranked joint multimodal interpretation to send to “the system's ‘application bridge’ agent, which confirms the interpretation with the user and sends it to the appropriate backend application.” This approach is inappropriate because the selection of the top-ranked joint multimodal interpretation is not sufficiently reliable for using it without user confirmation, and this approach obviously slows down the speed of input.
Another limitation of some reported implementations is that there is no proposed mechanism for handling independent, non-ambiguous multimodal interpretations from one modality in one turn; the duration of turns have to be managed to avoid independent, non-ambiguous multimodal interpretations from one modality in one turn, and when such management fails, unreliable joint multimodal interpretations result.
What is needed is a more comprehensive and reliable approach for handling multimodal inputs so as to generate better information to pass on to applications.
Skilled artisans will appreciate that elements in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. For example, the dimensions of some of the elements in the figures may be exaggerated relative to other elements to help to improve understanding of embodiments of the present invention.