Two common types of speech recognition systems are continuous and discrete. Continuous speech recognition systems detect and discern useful information from continuous speech patterns. In use, an operator may speak phrases and sentences without pausing and the continuous speech recognition system will determine the words being spoken. Continuous speech recognition systems are used, for example, in voice-input word processors that enable operators to dictate letters directly to the computer.
Discrete speech recognition systems are designed to detect individual words and phrases that are interrupted by intentional pauses, resulting in an absence of speech between the words and phrases. Discrete speech recognition systems are often used in “command and control” applications in which an operator speaks individual commands to initiate corresponding predefined control functions. In a typical use, the operator speaks a command, pauses while the system processes and responds to the command, and then speaks another command. The system detects each command and performs the associated function.
In all speech recognition systems, various forms of feedback are used to indicate to the user when the system is active and ready for speech input. In many PC based systems, feedback is provided by means of onscreen visual elements. As an example, in some commercially available dictation systems, an icon is present and flashing indicating to the user that he/she can begin dictation. Text appears on screen as spoken words begin to be recognized. In this case, users are trained that they can speak at any time until they actively shut the recognition system off.
In data access systems, feedback is provided by spoken or audio prompts. As an example, feedback can be modeled after a conversation. The system speaks a key word or phrase, followed by a pause. It is after this pause that the user must respond with their chosen command. In this example, users are trained that they must speak after the pause and before the system times out.
Not all environments that employ a speech recognition system have the luxury of providing such clean exchanges between the system and user (i.e., knowing when the system speaks and when the user speaks). In some environments, users are concentrating on a primary task and using speech as a method of input because their hands and eyes are otherwise occupied. In this situation, feedback needs to be quick and succinct, requiring little attention from the user.
Speech interface systems can be designed to be always awake and available to accept speech commands from the user. This is very much like how two people hold a conversation. Even if one person is talking, they can still hear responses from the other person. Both talking and listening can be done at the same time. While this is a natural style of interaction, technical limitations of certain speech systems do not allow it. In many cases, if the system is always awake, it may recognize any extraneous sound it hears. For instance, if a speech system in a car is always listening for all speech commands while the radio is playing, the system may pick up words from the radio and carry out actions not intended by the vehicle operator. This is confusing and frustrating for the operator.
To avoid this potentially confusing situation, speech systems can be designed to be awake for limited periods of time and when awake, to utilize limited sets of recognizable words. A complete list of recognized words or phrases is referred to as the “vocabulary”, and a subset of the vocabulary that the recognition system is attempting to detect at any one time is known as the “grammar.” In general, the smaller the active grammar, the more reliable the recognition because the system is only focusing on a few words or phrases. Conversely, the larger the active grammar, the less reliable the recognition because the system is attempting to discern a word or phrase from many words or phrases.
Once a command is given and accepted by the system, the user is given a predefined time limit to speak other commands in the grammar before the system goes back to sleep and stops accepting commands. Since the system is initially listening for only one or two commands, random and unwanted recognition of extraneous words is greatly reduced. However, operating a speech system that has sleep and active modes, as well as changing grammars, can be difficult and/or confusing to the operator in the absence of some form of feedback.
Accordingly, there is a need for speech recognition user interfaces to guide operators through the various states and options of the speech system by using cues that can be readily ascertained by the user in a casual hands-free, at a glance environment.
Another problem contemplated by the inventors concerns other types of communications that rely on asynchronous messages. For example, video conferencing, teleconferencing, and certain network-based software provide a distributed collaboration environment in which two or more people collaborate. In such situations, it is difficult to tell in some remote collaboration software that a person at one of site has tried to break into the conversation.
In this distributed collaboration environment, common face-to-face cues that people intuitively rely on to know when to enter into the conversation may not be available. The video may be blurry, or not all participants are visible, or other problems may prevent traditional conversational cues.
Accordingly, there is a need for a system that provides visual and/or auditory cues to facilitate distributed communications where traditional visual feedback is unattainable for technical and other reasons.