Speech prompted interfaces have been used in telecommunications systems in contexts where there is no visual display or the user is unable to use a visual displays. Typically, a speech interface prompts the user when to speak by providing a speech prompt, i.e. a recognisable phrase or question prompting for user input, or by emitting a `speak now` beep, i.e. an earcon. After the prompt, a speech recognizer is turned on for a limited time window, typically a few seconds, during which time the user may respond.
Telecommunications systems with a speech recognition capability have been in use for some time for performing basic tasks such as directory dialling. There are also network based speech recognition servers that deliver speech enabled directory dialling to any telephone. Typically, when these systems also offer a graphical user interface, i.e. a visual display, with a speech interface, in addition to a conventional tactile interface, i.e. a keypad, interfaces are discrete and non-integrated. That is, the system does not allow user tactile input and speech input at the same time.
Computer users have long been used to inputting data using a keyboard or drawing tablet, and receiving output in graphical form, i.e. visual information from a screen display which may include, full motion, colour displays with supporting auditory `beeps`. Speech processors for computers are now available with speech recognizers for receiving speech input, and converting to text, and speech processors for providing speech output. Typically, the speech processing is embedded within an application which is turned on and off by the user as required.
Speech output and speech recognition capability are being added to a variety of other electronic devices. Devices may be provided with tactile interfaces in addition to, or instead of, conventional keypads for inputting data. For example, there are a number of hand-held devices, e.g. personal organisers, that support pen input, for writing a touch sensitive area of a display, and cellular phones may have touch-sensitive displays as well as a regular numeric keypad.
To overcome the inconvenience of switching between discrete applications offering different modes of interaction, systems are being developed to handle more than one type of interface, i.e. more than one mode of input and output simultaneously. In the following description the term input/output modality refers to a sensory modality relating to a user's behaviour in interacting with the system, i.e. by using auditory, tactile and visual senses. Input/output modes refer to specific examples of use of these modalities. For example speech and audio input/output represent an auditory modality; use of a keypad, pen, and touch sensitive buttons represent a tactile input modality, and viewing a graphical display relies on the visual modality.
An example of a multimodal interface is described in copending U.S. application Ser. No. 08/992,630 entitled "Multimodal User Interface", filed Dec. 18, 1997, to Smith and Beaton, which is incorporated herein by reference. This application discloses a multi-modal user interface to provide a telecommunications system and methods to facilitate multiple modes of interfacing with users for example, using voice, hard keys, touch sensitive soft key input, and pen input. This system provides, e.g. for voice or key input of data, and for graphical and speech data output. The user may choose to use the most convenient mode of interaction with the system and the system responds to input from all modes.
While interfaces for communications devices and computer systems are becoming increasingly able to accept input and provide output through various sensory modalities, existing systems and devices present some problems when the user tries to use particular input/output modalities according to the task at hand.
In using such an interface, for example, a user might request an item using speech, and then be presented with a list of choices on the screen, that requires some scrolling to access the relevant section of the list. At this point the user may choose to touch the scroll control and then touch an item on the list that they require.
Ideally the user wants to smoothly transition from one type of input/output modality to another, e.g. from a primarily speech input/output to a graphical and touch control structure. However there are problems with providing this transition in practice because there is an intrinsic conflict between speech interaction and graphical interaction styles.
Current graphical interfaces are directed through a task by a user. Nothing happens unless a user clicks on a screen based object or types from a keyboard. The user maintains control of the interaction, and can pause and restart the task at any time.
In contrast, speech interfaces tend to direct a user through a task. The user initiates the interaction, and thereafter the speech recognizer prompts the user for a response, i.e. asks the user to repeat a name, etc. and expects an almost immediate input. As mentioned above, speech recognizers for communications devices typically operate within a limited time window, usually within a few seconds after a speech prompt. Thus, the timing of the listening window of speech recognizer controls the requirement for the user to respond, to avoid an error or reprompting. Users often report feeling rushed when prompted to respond immediately after a beep or other speech prompt.
Natural language processors are known, which are on all the time, and thus can accept speech input at any time. However, these advanced speech recognizers require processing power of a network based system and are not yet widely used. Consequently, for most speech recognizers, there is a limited time window to respond after a speech prompt, and the user receives no indication of how long there is to respond.
In use of a multimodal interface, a user may feel particularly pressured after switching to a touch and/or graphical input/output mechanism, when the voice prompts remain active. A user who receives both graphical prompts and speech prompts, may be confused as to which is the appropriate mode to provide the next input, or may interpret dual prompts to be annoying or redundant.
In some systems, speech prompts may be manually turned on and off by the user to avoid this problem. However, this procedure introduces an intrusive, unnecessary step in an interface, necessitating that a user must remember to switch on the speech interface before providing speech input, and switch off before providing input by other modes. Furthermore, manual switching on and off of the speech interface does not address management of speech based error recovery mechanisms. For example, if a user switches from speech input to pen input, and the speech interface remains on, and has the same sensitivity to detected speech input, a cumbersome error recovery mechanism may be invoked in cases where the recognizer was unable to detect spoken input, or was unable to interpret the detected spoken input despite the presence of a specific pen input.