1. Field Of The Invention
The invention relates to speaker barge-in in connection with voice recognition systems, and comprises-method and apparatus for detecting the onset of user speech on a telephone line which also carries voice prompts for the user.
2. Prior Art
Voice recognition systems are increasingly forming part of the user interface in many applications involving telephonic communications. For example, they are often used to both take and provide information in such applications as telephone number retrieval, ticket information and sales, catalog sales, and the like. In such systems, the voice system distinguishes between speech to be recognized and background noise on the telephone line by monitoring the signal amplitude, energy, or power level on the line and initiating the recognition process when one or more of these quantities exceeds some threshold for a predetermined period of time, e.g., 50 ms. In the absence of interfering signals, speech onset can usually be detected reliably and within a very brief period of time.
Frequently telephonic voice recognition systems produce voice prompts to which the user responds in order to direct subsequent choices and actions. Such prompts may take the form of any audible signal produced by the voice recognition system and directed at the user, but frequently comprise a tone or a speech segment to which the user is to respond in some manner. For some users, the prompt is unnecessary, and the user frequently desires to xe2x80x9cbarge inxe2x80x9d with a response before the prompt is completed. In such circumstances, the signal heard by the voice recognition system or xe2x80x9crecognizerxe2x80x9d then includes not only the user""s speech but its own prompt as well. This is due to the fact that, in telephone operation, the signal applied to the outgoing line is also fed back, usually with reduced amplitude, to the incoming line as well, so that the user can hear his or her own voice on the telephone during its use.
The return portion of the prompt is referred to as an xe2x80x9cechoxe2x80x9d of the prompt. The delay between the prompt and its xe2x80x9cechoxe2x80x9d is on the order of microseconds and thus, to the user, the prompt appears not as an echo but as his or her own contemporaneous conversation. However, to a speech recognition system attempting to recognize sound on the input line, the prompt echo appears as interference which masks the desired speech content transmitted to the system over the input line from a remote user.
Current speech recognition systems that employ audible prompts attempt to eliminate their own prompt from the input signal so that they can detect the remote user""s speech more easily and turn off the prompt when speech is detected. This is typically done by means of local xe2x80x9cecho cancellationxe2x80x9d, a procedure similar to, and performed in addition to, the echo cancellation utilized by the telephone company elsewhere in the telephone system. See, e.g., xe2x80x9cA Single Chip VLSI Echo Cancelerxe2x80x9d, The Bell System Technical Journal, vol. 59, no. 2, February 1980. Speech recognition systems have also been proposed which subtract a system-generated audio signal broadcast by a loudspeaker from a user audio signal input to a microphone which also is exposed to the speaker output.
See, for example, U.S. Pat. No. 4,825,384, xe2x80x9cSpeech Recognizer,xe2x80x9d issued Apr. 25, 1989 to Sakurai et al. Systems of this type act in a manner similar to those of local echo cancellers, i.e., they merely subtract the system-generated signal from the system input.
Local echo cancellation is helpful in reducing the prompt echo on the input line, but frequently does not wholly eliminate it. The component of the input signal arising from the prompt which remains after local echo cancellation is referred to herein as xe2x80x9cthe prompt residuexe2x80x9d. The prompt residue has a wide dynamic range and thus requires a higher threshold for detection of the voice signal than is the case without echo residue; this, in turn, means that the voice signal often will not be detected unless the user speaks loudly, and voice recognition will thus suffer. Separating the user""s voice response from the prompt is therefore a difficult task which has hitherto not been well handled.
A. Objects of The Invention
Accordingly, it is an object of the invention to provide a method and apparatus for implementing barge-in capabilities in a voice-response system that is subject to prompt echoes.
Further, it is an object of the invention to provide a method and apparatus for implementing barge-in a telephonic voice-response system.
Another object of the invention is to provide a method and apparatus for quickly and reliably detecting the onset of speech in a voice-recognition system having prompt echoes superimposed on the speech to be detected.
Yet another object of the invention is to provide a method and apparatus for readily detecting the occurrence of user speech or other user signalling in a telephone system during the occurrence of a system prompt.
B. Brief Description Of The Preferred Embodiment of The Invention
In accordance with the present invention, I remove the effects of the prompt residue from the input line of a telephone system by predicting or modeling the time-varying energy of the expected residue during successive sampling frames (occupying defined time intervals)over which the signal occurs and then subtracting that residue energy from the line input signal. In particular, I form an attenuation parameter that relates the prompt residue to the prompt itself. When the prompt has sufficient energy, i.e., its energy is above some threshold, the attenuation parameter is preferably the average difference in energy between the prompt and the prompt residue over some interval. When the energy of the prompt is below the stated threshold, the attenuation parameter may be taken as zero.
I then subtract from the line input signal energy at successive instants of time the difference between the prompt signal and the attenuation parameter. The latter difference is, of course, the predicted prompt residue for that particular moment of time. I thereafter compare the resultant value with a defined detection margin. If the resultant is above the defined margin, it is determined that a user response is present on the input line and appropriate action is taken. In particular, in the embodiment that I have constructed that is described herein, when the detection margin is reached or exceeded, I generate a prompt-termination signal which terminates the prompt. The user response may then reliably be processed.
The attenuation parameter is preferably continuously measured and updated, although this may not always be necessary. In one embodiment of the invention that I have implemented, I sample the prompt signal and line input signal at a rate of 8000 samples/second (for ordinary speech signals) and organize the resultant data into frames of 120 samples/frame. Each frame thus occupies slightly less than one-sixtieth of a second. Each frame is smoothed by multiplying it by a Hamming window and the average energy within the frame is calculated. If the frame energy of the prompt exceeds a certain threshold, and if user speech is not detected (using the procedure to be described below), the average energy in the current frame of the line input signal is subtracted from the prompt energy for that frame. The attenuation parameter is formed as an average of this difference over a number of frames. In one embodiment where the attenuation parameter is continuously updated, a moving average is formed as a weighted combination of the prior attenuation parameter and the current frame.
The difference in energy between the attenuation parameter as calculated up to each frame and the prompt as measured in that frame predicts or models the energy of the prompt residue for that frame time. Further, the difference in energy between the line input signal and the predicted prompt residue or prompt replica provides a reliable indication of the presence or absence of a user response on the input line. When it is greater than the detection margin, it can reliably be concluded that a user response (e.g., user speech) is present.
The detection system of the present invention is a dynamic system, as contrasted to systems which use a fixed threshold against which to compare the line input signal. Specifically, denoting the line input signal as Si, the prompt signal as Sp, the attenuation parameter as Sa, the prompt replica as Sr, and the detection margin as Md, the present invention monitors the input line and provides a detection signal indicating the presence of a user response when it is found that:
Sixe2x88x92Md greater than Spxe2x88x92Sa=Sr 
or
Si greater than Md+Spxe2x88x92Sa=Md+Sr 
The term Md+Sr in the above equation varies with the prompt energy present at any particular time, and comprises what is effectively a dynamic threshold against which the presence or absence of user speech will be determined.
In one implementation of the invention that I have constructed, the variables Si, Sp, Sa and Sr are energies as measured or calculated during a particular time frame or interval, or as averaged over a number of frames, and Md is an energy margin defined by the user. The amplitudes of the respective energy signals, of course, define the energies, and the energies will typically be calculated from the measured amplitudes. The present invention allows the fixed margin Md to be smaller than would otherwise be the case, and thus permits detection of user signalling (e.g., user speech) at an earlier time than might otherwise be the case.