Voice processing systems whereby callers interact over the telephone network with computerized equipment are very well-known in the art, and include voice mail systems, voice response units, and so on. Typically, such systems ask a caller (or called party) questions using prerecorded prompts, and the caller inputs answers by pressing dual tone multiple frequency (DTMF) keys on their telephones. This approach has proved effective for simple interactions, but is clearly restricted in scope due to the limited number of available keys on a telephone. For example, alphabetical input is particularly difficult using DTMF keys.
There has therefore been an increasing tendency in recent years for voice processing systems to use speech recognition in order to augment DTMF input (the terms “speech recognition” and “voice recognition” are used interchangeably herein to denote the act of converting a spoken audio signal into text). The utilization of speech recognition permits the handling of callers who do not have a DTMF phone, and also the acquisition of more complex information beyond simple numerals from the caller.
As an illustration of the above, PCT WO96/25733 describes a voice response system which includes a prompt unit, a Voice Activity Detector (VAD), and a voice recognition unit. In this system, as a prompt is played to the caller, any input from the caller is passed to the VAD, together with the output from the prompt unit. This allows the VAD to perform echo cancellation on the incoming signal. Then, in response to the detection of voice by the VAD, the prompt is discontinued, and the caller input is switched to the recognition unit, thereby providing a bargain facility.
Speech recognition in a telephony environment can be supported by a variety of hardware architectures. Many voice processing systems include a special digital signal processing (DSP) card for running speech recognition software. This card is connected to a line interface unit for the transfer of telephony data by a time division multiplex (TDM) bus. Most commercial voice processing systems, more particularly their line interface units and DSP cards, conform to one of two standard architectures: either the Signal Computing System Architecture (SCSA), or the Multi-vendor Integration Protocol (MVIP). A somewhat different configuration is described in GB 2280820, in which a voice processing system is connected via a local area network to a remote server, which provides a speech recognition facility. This approach is somewhat more complex than the TDM approach, given the data communication and management required, but does offer significantly increased flexibility.
Speech recognition systems are generally used in telephony environments as cost-effective substitutes for human agents, and are adequate for performing simple, routine tasks. It is important that such tasks are performed accurately, otherwise there may be significant caller dissatisfaction; and also as quickly as possible, both to improve caller throughput, and because the owner of the voice processing system is often paying for the call via some FreePhone mechanism (e.g., a 1-800 number), or because an outbound application is involved.
(Note that as used herein, the term “caller” simply indicates the party at the opposite end of a telephone connection to the voice processing system, rather than to specify which party actually initiated the telephone connection.)
One facility in prior art voice processing systems to help accelerate call handling and also to improve the user interface is barge-in. As briefly indicated above, this is where voice recognition is enabled on an incoming channel at the same time as the system is playing a prompt on the corresponding outgoing channel. This allows a caller to interrupt the prompt as soon as they know what response to give. For example, if the prompt is “Say Account for account information, say Order to order material, or say Transfer to speak to an operator”, and the caller wants account information, barge-in allows the caller to interrupt the prompt by saying “Account” before the complete prompt has finished. This is particularly useful for regular callers who are familiar with the application and the prompt menus. Following such an interruption, the application abandons the rest of the prompt and the caller interruption is passed to the recognition system for processing. The application can then proceed further on the basis of what is returned from the recognition system.
One problem with prior art barge-in systems is that they can be confused by noise on the telephone line. For example, if the caller coughs, the outgoing prompt may be suspended even though the caller actually still desires to hear the rest of the prompt. This can leave a very awkward situation, with the machine expecting further input from the caller, and the caller expecting further output from the machine. The result can be a suspended or confused dialogue with the caller, resulting ultimately in a wasted or highly ineffective call.
A known solution to this for discrete word (small vocabulary) recognition systems, which typically only recognize one or two dozen different inputs (e.g., numerals 0-9), is to wait for the recognition result to be returned before interrupting the outgoing prompt. Thus, if the supposed caller input is not recognized, perhaps because it is noise or some irrelevant caller interjection, then the play out of the prompt is continued. In other words, the prompt is only interrupted where there is a successful recognition result.
Although this approach which essentially involves modelling the recognition system and application to the likely range of caller responses is effective for discrete word systems, more modern voice processing applications often involve large vocabulary speech recognition for which such modelling is not feasible. For these applications, the provision of barge-in is prone to trigger the termination of the prompt even in circumstances where this was not actually the intention of the caller.