1. Field of the Invention
The present invention relates to a voice based multimodal input system with interactive context adjustments via explicit mediation, and more particularly, to a software-driven voice based multimodal input system with interactive context adjustments via control signals.
2. Description of the Related Art
Most prevalent and natural means of interactive communication is through the spoken language. Especially, for communication in real time, since there is no temporal gap, there is no need for storage, and there is no need for conversion into written language. This realtime-ness is a constraint and an advantage at the same time. The use of voice signals is not nearly as prevalent in interfacing computers or electronic devices. When applying this natural mode of interactive communication to human machine interface, the interactiveness can be leveraged. In other words, other kinds of interactive input modalities can be integrated to mediate the voice processing process. Research in cognitive science confirms that human brain also rely on integrating cues from plurality of sensing modalities to recognize speech. This is referred to as the McGurk effect.
Here, we classify conventional arts in integration and mediation schemes for voice recognition, as depicted in FIG. 1. Interactively mediating 110 voice recognition can be done either at pre-processing stage 112 or post-processing stage 111. Most existing voice recognition systems used in computers do have interactive interface to confirm the results processed by the recognition module, which occurs in the post-processing stage. U.S. Pat. No. 4,829,576, issued May 9, 1989, to Edward W. Porter, discloses a menu driven interface 117 for post-process confirmation. For pre-processing stage mediation 112, there is either hardware-driven mediation 113 or software-driven mediation 114. A hardware-driven pre-processing mediation 113 is disclosed in the aforementioned U.S. Pat. No. 4,829,576: a hardware switch 118 to convert between dictation mode and command mode. For software-driven mediation 114 at pre-processing stage, further division exists; between implicit 115 and explicit 116. Explicit software-driven mediation 116 at pre-processing stage provides explicit information, such as speech period start and termination point, or referent target of the command. Aforementioned U.S. Pat. No. 4,829,576 discloses a method of using voice signal amplitude 122 to determine speech period start and termination point. Alternatively, U.S. Pat. No. 5,884,257, issued Mar. 16, 1999, to Idetsugu Maekawa et al, discloses a method of using lip image processing 123 to determine speech period start point and termination point. U.S. Pat. No. 6,990,639 B2, issued Jan. 24, 2006, to Andrew Wilson, discloses integration of a pointing device 124 to determine which component a user wants to control and what control action is desired. In above three patents, mediation of the voice recognition occurs with an explicit input, such as lip movements or pointing device motions. For the ‘implicit’ software-driven mediation 115 (at pre-processing stage), number of prior arts exists as well. Implicit software-driven mediation 115 at pre-processing stage can aid in context determination, for more efficient recognition. U.S. Pat. No. 5,615,296, issued Mar. 25, 1997, to Vincent M. Stanford et al, discloses software based algorithm to implicitly perform high-speed context switching 119 to modify the active vocabulary. Also, U.S. Pat. No. 5,526,463, issued Apr. 9, 1993, to Laurence S. Gillick et al, discloses a software algorithm to use beginning part of speech to pre-filter 120 the set of vocabulary to match against. Finally, U.S. Pat. No. 5,677,991, issued Oct. 14, 1997, to Dong Hsu et al, discloses arbitration algorithm 121 to mediate between “large vocabulary isolated word speech recognition (ISR) module” and “small vocabulary continuous speech recognition (CSR) module.” All three patents above implicitly infer cues embedded in speech without explicit user input. All three of the implicit software-driven mediation 115 at pre-processing stage, by design, increases recognition accuracy while reducing computation. This is not always the case with integration schemes for multiple sensing modalities. Aforementioned U.S. Pat. No. 6,990,639 B2 124 provides means of augmenting context information at a cost of increased computation; this patent, with the combined use of pointing device and voice input, augments voice commands with the referent or the target of the command, as a form of context information. Increased computational cost is due to independent processing of voice inputs and the pointing device inputs. Another such example is U.S. Pat. No. 6,499,025 B1, issued Dec. 24, 2002, to Eric J. Horvitz, which discloses methodology of integrating multiple sensing modalities. With each added sensing modality, Bayesian inference engine 126 is added, and computation is increased proportionately.
However, each one of these references suffers from one or more of the disadvantages. Therefore, development of more efficient system with an increased accuracy and without increasing computation is required.
The above information disclosed in this Background section is only for enhancement of understanding of the background of the invention and therefore it may contain information that does not form the prior art that is already known in this country to a person of ordinary skill in the art.