A user interface “modality” may be defined as a mode of interaction between a user and an electronic device, where the interaction is implemented through the device's user interface. A user interface modality may be implemented through a combination of hardware and software associated with a particular type of human perceptible information (e.g., information perceptible by sight, sound or touch) and/or human-generated information (e.g., information generated through speech or other physical action). For example, one type of user interface modality is a “visual modality,” which may be implemented through a display screen and associated hardware and software for generating visual displays on the display screen. A visual modality also may be implemented using various input devices that facilitate user interaction with a visual display, such as input devices that enable the user to select information that is rendered in the visual display (e.g., using a scrolling mechanism, touchscreen or arrow keys), to enter information into fields of the visual display (e.g., using a keypad), and/or to change the focus of the visual display from one field to another. Another type of user interface modality is a “voice modality,” which may be implemented using a microphone, speaker, and associated hardware and software adapted to receive and digitize human speech, and/or to output audio information (e.g., audio prompts or other audio information). Other types of user interface modalities include, for example, gestural modalities, and pen modalities, to name just two.
In the interest of providing improved usability over “single-modality” devices, many electronic devices include a “multi-modal” user interface, which is a user interface that provides more than one user interface modality. For example, an electronic device may provide both a visual modality and a voice modality. Such a device may, for example, simultaneously output visual information (e.g., displayed information) and associated audio information (e.g., audio prompts), and/or the device may enable a user to input information via speech, a keypad or both, as the user desires. Generally, a device having a multi-modal user interface provides an improved user experience, because the user may choose the modality with which he or she will interact with the device. Interaction using a voice modality may be desirable, for example, in situations in which the user desires hands-free interaction, such as when typing is too time-consuming and/or when the user is impaired either permanently (e.g., due to arthritis or some other physical disability) or situationally (e.g., when the user is wearing gloves and/or when the user's hands are occupied with other tasks). In contrast, interaction using a visual modality may be desirable, for example, in situations in which complex information is to be rendered, when auditory privacy is desired, when noise restrictions exist, and/or when there are auditory constraints either permanently (e.g., when the user has a heavy accent, a speech impediment, and/or a hearing impairment) or situationally (e.g., when there is significant background noise or noise restrictions).
A multi-modal user interface may be implemented in conjunction with an application that operates in a networked environment (e.g., a client-server system environment). In such a case, the user interacts with the multi-modal user interface on a client device (e.g., a cellular telephone or computer), and the client device communicates with one or more other devices or platforms (e.g., a server) over a network. In such a networked environment, two basic techniques have been implemented to design client-server system elements that support a multi-modal user interface, and more particularly system elements that support a user interface adapted to provide at least visual and voice modalities. Using an “embedded” technique, substantially all of the requisite hardware and software associated with the multiple modalities are included in the client device itself For example, the client device may include software and hardware adapted to perform audio-related tasks, such as speech processing, speech recognition, and/or speech synthesis, among other things. Generally, such audio-related tasks necessitate special processors or processing engines (e.g., digital signal processors) and substantial amounts of memory (e.g., for storing tables and software associated with the audio-related tasks). Using a “distributed” technique, some of the processing associated with one or more of the modalities may be shifted to another processing element, such as a remote server. For example, when a user speaks, audio data may be sent from the client device to a remote server, and the remote server may perform some or all of the audio-related tasks and return data, error messages, and/or processing results to the client device.
Each technique has its advantages and disadvantages. For example, an advantage to some conventional distributed techniques is that some of the computationally intensive processes associated with a multi-modal user interface (e.g., audio-related tasks) may be shifted out of the client device to another processing element (e.g., a remote server), as just mentioned. Accordingly, the client device may not include special processors or processing engines (e.g., digital signal processors) and extra memory for implementing the tasks that are displaced from the client device. This means that the client device may be designed in a more cost-effective manner (e.g., the device may be designed as a “thin” client) than client devices that implement an embedded technique.
However, using conventional distributed techniques, the states of the various modalities need to be synchronized between the client and the server. Consistent synchronization between the states of multiple modalities is difficult to achieve across a network. More particularly, using conventional distributed techniques, synchronization between a visual modality and a voice modality may be unreliable due to latencies inherent in the network communications, among other things. For example, when a user verbally provides an input for a data entry field, the client device sends data reflecting the verbal input to the server and waits for a speech recognition result to be returned by the server before the visual display can be updated to display the speech recognition result. In some cases, updating the visual display to reflect the verbally provided input may not occur in a sufficiently timely manner, and the visual and voice modalities may become unsynchronized. In addition, implementation of a multi-modal user interface using conventional distributed techniques typically is performed using non-standard protocols and unconventional content authoring techniques. Accordingly, such techniques have not been readily embraced by the majority of carriers or application designers.
Accordingly, what are needed are multi-modal user interface methods and apparatus that may facilitate thin client designs and the use of standard protocols and conventional content authoring techniques, and that may overcome the synchronization issues inherent in conventional distributed techniques. Other features and characteristics of the inventive subject matter will become apparent from the subsequent detailed description and the appended claims, taken in conjunction with the accompanying drawings and this background.