A. Field of the Invention
The invention generally relates to speaker barge-in in connection with voice recognition systems, and relates more specifically to apparatus for detecting the onset of user speech on a telephone line which also carries voice prompts for the user.
B. Description of Related Art
Voice recognition systems are increasingly forming part of the user interface in many applications involving telephonic communications. For example, they are often used to both take and provide information in such applications as telephone number retrieval, ticket information and sales, catalog sales, and the like. In such systems, the voice system distinguishes between speech to be recognized and background noise on the telephone line by monitoring the signal amplitude, energy, or power level on the line and initiating the recognition process when one or more of these quantities exceeds some threshold for a predetermined period of time, e.g., 50 ms. In the absence of interfering signals, speech onset can usually be detected reliably and within a very brief period of time.
Frequently telephonic voice recognition systems produce voice prompts to which the user responds in order to direct subsequent choices and actions. Such prompts may take the form of any audible signal produced by the voice recognition system and directed at the user, but frequently comprise a tone or a speech segment to which the user is to respond in some manner. For some users, the prompt is unnecessary, and the user frequently desires to "barge in" with a response before the prompt is completed. In such circumstances, the signal heard by the voice recognition system or "recognizer" then includes not only the user's speech but its own prompt as well. This is due to the fact that, in telephone operation, the signal applied to the outgoing line is also fed back, usually with reduced amplitude, to the incoming line as well, so that the user can hear his or her own voice on the telephone during its use.
The return portion of the prompt is referred to as an "echo" of the prompt. The delay between the prompt and its "echo" is on the order of microseconds and thus, to the user, the prompt appears not as an echo but as his or her own contemporaneous conversation. However, to a speech recognition system attempting to recognize sound on the input line, the prompt echo appears as interference which masks the desired speech content transmitted to the system over the input line from a remote user.
Current speech recognition systems that employ audible prompts attempt to eliminate their own prompt from the input signal so that they can detect the remote user's speech more easily and turn off the prompt when speech is detected. This is typically done by means of local "echo cancellation", a procedure similar to, and performed in addition to, the echo cancellation utilized by the telephone company elsewhere in the telephone system. See, e.g., "A Single Chip VLSI Echo Canceler", The Bell System Technical Journal, vol. 59, no. 2, February 1980. Speech recognition systems have also been proposed which subtract a system-generated audio signal broadcast by a loudspeaker from a user audio signal input to a microphone which also is exposed to the speaker output. See, for example, U.S. Pat. No. 4,825,384, "Speech Recognizer," issued Apr. 25, 1989 to Sakurai et al. Systems of this type act in a manner similar to those of local echo cancellers, i.e., they merely subtract the system-generated signal from the system input.
Local echo cancellation is helpful in reducing the prompt echo on the input line, but frequently does not wholly eliminate it. The component of the input signal arising from the prompt which remains after local echo cancellation is referred to herein as "the prompt residue". The prompt residue has a wide dynamic range and thus requires a higher threshold for detection of the voice signal than is the case without echo residue; this, in turn, means that the voice signal often will not be detected unless the user speaks loudly, and voice recognition will thus suffer. Separating the user's voice response from the prompt is therefore a difficult task which has hitherto not been well handled.