Multi-modal user input systems support multiple different user input modalities with different response delays, even though user interaction occurs in real time and response delays are undesirable. Low delay inputs such as keyboard, mouse, pointing device, touch screen etc. respond to user inputs without significant delay. On the other hand, high latency inputs have a significant response latency after receiving a user input before providing a corresponding completed response.
For example, high latency inputs such as an automatic speech recognition input reflect a response latency that is inherent in the speech recognition process which requires some significant amount of audio (corresponding to several words) before being able to produce recognition text that matches the input speech with high degree of probability. In addition, a user input may also be associated with a remote server having a response latency that reflects data transfer delays occurring over a computer network. For example, a speech recognition process may need to send the input speech audio over a computer network to a remote server where the speech recognition engine resides, and the corresponding recognition text output may need to be sent back to the local client that displays the user interface to the user. The responsiveness of a multi-modal user input system is usually controlled by the input with the highest response latency.
The effects of response latencies can be minimized to some degree, but they cannot be entirely eliminated due to algorithmic limitations in the speech recognition process and physical limitations on computer network speed. Still, it is very desirable to minimize the effects of response latencies for the user.
In a real time speech recognition arrangement, the user effects associated with response latencies are two-fold. First, the user has no clear picture of the current state of the system. If an utterance has been spoken, but the recognized text has not yet appeared on the user interface, the system presents an undefined state to the user. For all the user knows, the system may have failed to record the audio, the network connection may have been interrupted in a server-based speech recognition system, the speech recognition engine may have failed to produce output text, or there may be a delay and results may be produced eventually.
In addition, the user speaker cannot continue with workflow tasks until the results from the pending input utterance have been completely processed and the user interface has been updated. For example, if a user has dictated text for a specific location in a document or form, and wants to dictate more additional text at a different location or form field, this is usually not possible until the recognition text from the first dictation has been inserted into the document.
In some cases, the waiting time caused by response latency simply must be accepted. For example, if the speaker dictates into a search field and wants to act on the search results, no action is possible until the results have been presented. On the other hand, maximizing the duration of a single workflow task can minimize some response latency effects. For example, response latency effects are reduced if the user can dictate a long document in one extended passage rather than waiting for each individual sentence to be displayed before dictating the next sentence. This suggests a “batch processing” work style that may not be desirable in highly interactive multi-modal applications that allow a mix of latency-encumbered input modes such as speech recognition, and low delay input modes that can be processed immediately in real time such as touch, mouse, or keyboard input.