Graphical user interfaces (GUI) have significantly improved computer human interface by employing intuitive real-world metaphors. However, GUIs are still far from achieving a goal of allowing users to interact with computers without significant training. In addition, GUIs often rely heavily on a graphical display, keyboard and pointing devices that are not always available. Mobile computers have constraints on physical size and battery power, or present limitations due to hands-busy eyes-busy scenarios which make employment of traditional GUIs a challenge. In addition, with more and more computers designed for mobile usages and hence subject to physical size and hands-busy or eyes-busy constraints, traditional GUI faces an even greater challenge with respect to interaction therewith.
Speech recognition technology enables a computer to automatically convert an acoustic signal uttered by users into textual words, freeing them from constraints of the standard desktop-style interface (such as for example mouse pointer, menu, icon, and window etc.). The technology has been playing a key role in enabling and enhancing human-machine communications. Speaking is the most natural form of human-to-human communication. One learns how to speak in childhood, and people exercise speaking communication skills on a daily basis. The possibility to translate this naturalness of communication into the capability of a computer is a logical expectation, since computers are equipped with substantial computing and storage capacities.
However, the expectation that computers should be good at speech has not yet become a reality. One important reason for this is that speech input is prone to error due to imperfection of speech recognition technology in dealing with variabilities from speaker(s), speaking style, and acoustic environment. While spoken language has the potential to provide a natural interaction model, the difficulty in resolving ambiguity of spoken language and the high computational requirements of speech technology have so far prevented it from becoming main stream in a computer's user interface. This imperfection, in addition to a number of social and other reasons, raises the issue that speech alone is not sufficient as a most desirable input to computers. Use of multimodal inputs in a human computer interface (HCI) system, which fuses two input modalities (e.g., speech and pen, or speech and mouse) to overcome imperfection of speech technology in its robustness as well as to complement speech input in other ways has been explored. However, conventional multi-modal input systems have considerable room for improvement toward providing an efficient HCI.