Telecommunications systems with a speech recognition capability have been in use for some time for performing basis tasks such as directory dialling. There are also network based speech recognition servers that deliver speech enabled directory dialling to any telephone. Typically, speech prompted interfaces have been used in telecommunications systems in contexts where there is no visual display or the user is unable to use visual displays, for example, a conventional telephone terminal.
Typically, the speech interface prompts the user when to speak by providing a speech prompt, i.e. a recognisable phrase or question prompting for user input, or by emitting a `speak now` beep after which the speech recognizer is turned on, for a limited time window, typically a few seconds, during which the user may respond.
Users of telecommunications equipment employing speech recognition systems often report feeling rushed when prompted to respond immediately after a beep or other audible prompt.
Part of this rushed feeling may be attributed to a sense that the device will stop recognition before the user has completed their verbal request because the user receives no indication of the time window available to respond after the recognizer it turned on or when the recognition window is open. The user may find it difficult to know when the speech recognizer is on, may talk when the recognizer is off, or may become confused by no response.
Other difficulties may occur if the user does not remember what is acceptable input vocabulary or grammar to use. In addition to the sense of having to respond right now, current speech interface structures do not provide the user with an opportunity to rephrase a request, or change their mind before waiting for the system to respond. The user's utterance is accepted and interpreted, and the system advances to the next logical state, which may result in an error, for example if the user mis-speaks, coughs, or simply makes a mistake. Similarly, undue hesitation after a partial response may cause the system to move prematurely to the next logical state. If this state is not where the user wants to be, the user must navigate back to the previous state and restate the request.
Currently, the best recognizers in use have a 90 to 95 percent recognition performance under optimum conditions, and a noisy background environment, other speakers, user accents, the user speaking to softly, may adversely affect recognition performance.
When conditions are not optimum, additional dialogue may assist. For example, the recognizer may give repeat instructions, or provide additional instructions. Nevertheless, using speech to provide additional information is slow. Consequently the user may perceive an excessively long wait for the system to reset and issue a new prompt. Typically, speech is perceived as fast for input, and slow for output.
Many users report becoming frustrated with using interactive voice response (IVR) systems offering many choices or a multi level menu system of choices. The user may forget long lists of instructions, or become confused or lost in a complex speech application.
User difficulties in interacting with these systems represent some reasons such speech interfaces have not yet gained as widespread acceptance as they might.
Older systems which also provide a graphical user interface, i.e. a screen display, with a speech interface, have been discrete non-integrated techniques That is the system may use either a touch input or a speech input, but not both simultaneously.
To overcome the inconvenience of switching between discrete applications offering different modes of interaction, systems are being developed to handle more than one type of interface, i.e. more than one mode of input and output, simultaneously. In the following description the term input/output modality refers to a sensory modality relating to a user's behaviour in interacting with the system, i.e. by using auditory, tactile and visual senses. Input/output modes refer to specific examples of use of these modalities. For example speech and audio input/output represent an auditory modality; use of a keypad, pen, and touch sensitive buttons represent a tactile input modality, and viewing a graphical display relies on the visual modality.
An example of a multimodal interface is 08/992,630 entitled "Multimodal User Interface", filed Dec. 19, 1997, to Smith and Beaton, which is incorporated herein by reference. This application discloses a multi-modal user interface and provides a telecommunications system and methods to facilitate multiple modes of interfacing with users for example, using voice, hard keys, touch sensitive soft key input, and pen input. This system provides, e.g. for voice or key input of data, and for graphical and speech data output. The user may choose to use the most convenient mode of interaction with the system and the system responds to input from all modes.
Thus, interfaces for communications devices and computer systems are becoming increasingly able to accept input and provide output by various modes.
For example, current speech recognition interfaces may be used in association with an visual display showing an icon that indicates current word recognition state. These icons change visually when the recognition state changes from listening to not-listening. For example, a "talk now" icon may be displayed in the corner of the screen. While these icons indicate to the user that the speech recognizer is on, the icons do not overcome the users perception of urgency to talk before the window closes. Also, as mentioned above if an error is made, or speech input is interrupted by extraneous background noise, the system waits until the `talk now` or recognition window closes, and advances to the next logical state to recover from such an error, before issuing a new prompt and reopening the recognition window.
There also exist natural language speech interfaces that are always on, which preclude the need for beeps that inform the user of when to start talking. The user may speak at any time, and the recognizer will always be ready to listen. Currently this type of recognition is not yet widely distributed and used. These more advanced speech recognizers currently rely on a network based speech recognizer to provide the necessary processing power. Thus in the foreseeable future, this type of advanced speech recognition will co-exist with simpler forms of recognition that require a limited duration `time to talk` window, or recognition window.