1. Field of Invention
This invention is directed to parsing and understanding of utterances whose content is distributed across multiple input modes.
2. Description of Related Art
Multimodal interfaces allow input and/or output to be conveyed over multiple different channels, such as speech, graphics, gesture and the like. Multimodal interfaces enable more natural and effective interaction, because particular modes are best-suited for particular kinds of content. Multimodal interfaces are likely to play a critical role in the ongoing migration of interaction from desktop computing to wireless portable computing devices, such as personal digital assistants, like the Palm Pilot®, digital cellular telephones, public information kiosks that are wirelessly connected to the Internet or other distributed networks, and the like. One barrier to adopting such wireless portable computing devices is that they offer limited screen real estate, and often have limited keyboard interfaces, if any keyboard interface at all.
To realize the full potential of such wireless portable computing devices, multimodal interfaces need to support not just input from multiple modes. Rather, multimodal interfaces also need to support synergistic multimodal utterances that are optimally distributed over the various available modes. In order to achieve this, the content from different modes needs to be effectively integrated.
One previous attempt at integrating the content from the different modes is disclosed in “Unification-Based Multimodal Integration”, M. Johnston et al., Proceedings of the 35th ACL, Madrid Spain, pp. 281-288, 1997 (Johnston 1), incorporated herein by reference in its entirety. Johnston 1 disclosed a pen-based device that allows a variety of gesture utterances to be input through a gesture mode, while a variety of speech utterances can be input through a speech mode.
In Johnston 1, a unification operation over typed feature structures was used to model the integration between the gesture mode and the speech mode. Unification operations determine the consistency of two pieces of partial information. If the two pieces of partial information are determined to be consistent, the unification operation combines the two pieces of partial information into a single result. Unification operations were used to determine whether a given piece of gestural input received over the gesture mode was compatible with a given piece of spoken input received over the speech mode. If the gestural input was determined to be compatible with the spoken input, the two inputs were combined into a single result that could be further interpreted.
In Johnston 1, typed feature structures were used as a common meaning representation for both the gestural inputs and the spoken inputs. In Johnston 1, the multimodal integration was modeled as a cross-product unification of feature structures assigned to the speech and gestural inputs. While the technique disclosed in Johnston 1 overcomes many of the limitations of earlier multimodal systems, this technique does not scale well to support multi-gesture utterances, complex unimodal gestures, or other modes and combinations of modes. To address these limitations, the unification-based multimodal integration technique disclosed in Johnston 1 was extended in “Unification-Based Multimodal Parsing”, M. Johnston, Proceedings of COLING-ACL 98, pp.. 624-630, 1998 (Johnston 2), herein incorporated by reference in its entirety. The multimodal integration technique disclosed in Johnston 2 uses a multi-dimensional chart parser. In Johnston 2, elements of the multimodal input are treated as terminal edges by the parser. The multimodal input elements are combined together in accordance with a unification-based multimodal grammar. The unification-based multimodal parsing technique disclosed in Johnston 2 was further extended in “Multimodal Language Processing”, M. Johnston, Proceedings of ICSLP 1998, 1998 (published on CD-ROM only) (Johnston 3), incorporated herein by reference in its entirety.
Johnston 2 and 3 disclosed how techniques from natural language processing can be adapted to support parsing and interpretation of utterances distributed over multiple modes. In the approach disclosed by Johnston 2 and 3, speech and gesture recognition produce n-best lists of recognition results. The n-best recognition results are assigned typed feature structure representations by speech interpretation and gesture interpretation components. The n-best lists of feature structures from the spoken inputs and the gestural inputs are passed to a multi-dimensional chart parser that uses a multimodal unification-based grammar to combine the representations assigned to the input elements. Possible multimodal interpretations are then ranked. The optimal interpretation is then passed on for execution.