In communication networks, system operators often find it convenient to implement automated system voice announcements, or “prompts”, to inform subscribers, for example, of certain available features or certain actions that the subscriber must take to activate particular features. For many subscribers, this information is useful the first few times they hear it, but after hearing it several times, subscribers may wish to interrupt, or “barge-in”, during the system voice prompt because they already know what the prompt says and what actions they need to take. In existing communication networks, a barge-in capability is normally realized by running a standard speech recognizer during the voice prompt. In order to avoid erroneous speech recognizer output due to the input of both the user's voice and the sound (echo) of the prompt originating from the device's loudspeaker, an acoustic echo cancellation technique is normally utilized to suppress feedback of the prompt echo to the recognizer.
As next generation communication devices are developed, it will be increasingly important to have user-friendly man-machine interfaces (MMIs) that enable the devices to be operated in a hands-free mode. Multi-modal MMIs and intelligent speech-driven dialog interfaces are needed that are well accepted by users and that provide flexible interaction with the system. An improved capability to barge-in will be required in order to enable a user to interrupt a system prompt by simply speaking while the prompt is being played.
There are three major shortcomings of the existing methods of providing a barge-in capability. First, conventional echo cancellation algorithms may provide a very weak attenuation of the prompt echo. For example, echo attenuation of 10 dB or less may occur. This may cause serious problems with misrecognitions by the speech recognizer because the speech recognizer is triggered by the prompt echo. Second, standard adaptive echo cancellation methods require a fixed timing correlation between the voice input channel and the voice output channel. In distributed systems, however, it is often too expensive, or not possible, to obtain this temporal correlation, especially when the recognition server and the server that plays the prompt are separated by some distance. Third, standard adaptive echo cancellation methods require a considerable amount of processing power. This is a significant challenge for embedded systems with restrictive hardware constraints or for multi-channel applications where as many channels as possible have to be processed in parallel.
Due to the above-described shortcomings of conventional echo cancellation methodologies, an alternative methodology for enabling the user of a hands-free communication device to barge-in during a system voice prompt is needed.