1. Field of the Invention
The present invention relates to speech recognition. More particularly, the present invention relates to detection of echo residue in speech recognition systems.
2. Background Information
Speech recognition systems may include a speech recognition engine that recognizes speech received from a user over an incoming channel. In a speech recognition system that interacts with a user, the recording from the incoming channel should not contain data from the outgoing channel. For example, in a system that uses system prompts to prompt a user to speak, system prompt signals should reside on the out going channel but should not carry over to the incoming channel. Echo residue occurs when signals on one channel (e.g., incoming) result from signals on another (e.g., outgoing) channel. Echo residue is responsible for users having poor experiences with new speech recognition systems. In particular, the echo residue on an incoming channel distorts the speech signals from the user that are to be recognized by a speech recognition system.
Moderate echo residue can mask a user's speech as noise, and render the system non-responsive to any user input. Loud echo residue may be improperly recognized as user input, in which case a condition known as “self barge-in” occurs. There are many causes of echo residue, including loud prompts, a poor terminating device at the switch, wrong echo-cancellation settings in the telephony board, electromagnetic (EM) interference from other equipment, bad channels, bad line cards and poor speech recognition engine parameter settings. Based on the cause, the problem may be experienced consistently by all users, selectively by users on certain channels, or temporarily by users during a particular dialog state/prompt in an application.
Numerous articles on the subject of echo residue address a severe and widespread echo residue problem. However, the intermittent types of echo residue are often not addressed. The result is that many mature speech systems are still plagued with periodic complaints from users in terms of responsiveness, but a technical team has no good way of tracking down the problem.
In many cases, the speech engine vendor is ultimately contacted to manually analyze volumes of data. The data is sometimes compiled by technical teams who manually listen to numerous user input wave files. Even for a 240 channel/3000 daily call system, weeks of man hours are dedicated for this troubleshooting, and the results are still often unsatisfactory. Although some platforms promise echo-free environments, there are no dedicated commercial products or tools that are designed to efficiently detect echo residue when it does occur. Echo residue detection is the first step to eliminating echo residue itself, particularly in situations where the echo residue is caused by factors outside of the control of the platform provider.
Unlike generic echo problems in other types of audio systems, echo residue in speech applications such as interactive voice response (IVR) applications may have very particular domain-specific causes. Thus, detection techniques may be used to isolate the causes of echo residue, and each identified cause can be individually addressed.
Commercial speech recognition engines are capable of recording the speech received over the incoming channel. FIG. 6 shows an exemplary plot portraying a recording of a conventional speech interaction on an incoming channel as amplitude versus time. In the example shown in FIG. 6, the amplitude of the recorded signal on the plot is flat when a system prompt is playing, as the user is quietly listening and providing no input. The spike shown in FIG. 6 occurs when the user speaks.
FIG. 7 shows an exemplary recording in a wave (.wav) file that contains echo residue in an incoming channel. When a user is listening to the incoming audio data shown in FIG. 7 (i.e., in the initial flat portion of the plot), significant echo residue is present. If a speech recognition system were capable of distinguishing when speech starts by the significantly higher amplitudes in the latter portion of the plot, it might seem that a speech recognition system could identify the echo residue by the low amplitude signals before the start of speech. However, as shown in FIG. 8, an exemplary recording that contains only normal environmental noise (e.g., cell phone static, background noise) in an incoming channel is very similar to the recording that contains echo residue as shown in FIG. 7. Accordingly, the environmental noise has characteristics essentially identical to echo residue, and cannot be identified by signal processing techniques such as low pass filtering. As a result, a tremendous commitment of time is required for a human to manually review audio files in order to distinguish between environmental noise and echo residue.
Accordingly, a need exists for multi-pass echo residue detection with speech application intelligence. To solve the above-described problems, multi-pass echo residue detection with speech application intelligence is provided.