The present invention relates to voice processing systems, and more particularly to voice processing systems which utilise particular processing resources, such as voice recognition systems.
Voice processing systems whereby callers interact over the telephone network with computerised equipment are very well-known in the art, and include voice mail systems, voice response units, and so on. Typically such systems ask a caller (or called party) questions using prerecorded prompts, and the caller inputs answers by pressing dual tone multiple frequency (DTMF) keys on their telephones. This approach has proved effective for simple interactions, but is clearly restricted in scope due to the limited number of available keys on a telephone. For example, alphabetical input is particularly difficult using DTMF keys.
There has therefore been an increasing tendency in recent years for voice processing systems to use voice recognition in order to augment DTMF input. The adoption of voice recognition permits the handling of callers who do not have a DTMF phone, and also the acquisition of more complex information beyond simple numerals from the caller.
One particular concern with voice processing systems is to allow a caller to interrupt a prompt before it has finished (for example, if they are familiar with the system from regular use, such as might be the case with a voice mail system, and therefore know in advance their desired option). Most voice processing systems already allow a caller to interrupt a prompt by pressing an appropriate DTMF key. This is achieved by listening for incoming DTMF signals at the same time as an outgoing prompt is being played.
It is desirable to allow the caller to perform a similar interruption by speaking, rather than pressing a DTMF key. The caller input is processed by the voice recogniser, and the system then performs the requested action. The ability to accept such interruptions during an outgiong prompt is known as barge-in or cut-through.
One difficulty with the support of barge-in is that an outgoing prompt may be partially echoed back from the telephone network, and then accidentally mistaken for voice input from the caller, or else distort actual voice input from the caller. Many voice recognition systems therefore include an echo cancellation facility in order to facilitate barge-in. Effectively, such echo cancellation involves subtracting from the incoming signal a suitably transformed version of the outgoing signal, the intention being that the subtracted transformed version of the outgoing signal negates any echo that might be received.
The following documents are illustrative of the current art in the voice processing industry.
WO96/25733 (BT) describes a voice response system which includes a prompt unit, a Voice Activity Detector (VAD), and a voice recognition unit. In this system, as a prompt is played to the caller, any input from the caller is passed to the VAD, together with the output from the prompt unit. This allows the VAD to perform echo cancellation on the incoming signal. Then, in response to the detection of voice by the VAD, the prompt is discontinued, and the caller input is switched to the reco unit, thereby providing a barge-in facility.
U.S. Pat. No. 5,459,781 describes a voice processing system with a line interface unit, an echo cancellation unit (adjacent to the line interface unit), a voice activity detector (VAD), a prompt unit, a DTMF detector, and a recorder. In this system, both incoming and outgoing signals pass through the echo cancellation unit, where echo cancellation is performed. This system addresses the problem that caller voice input to the recorder may accidentally be mistaken for DTMF input. Therefore, if the VAD detects incoming speech to be recorded, the DTMF detection is switched off (since it is unlikely that the caller would make a genuine DTMF input at this time). It is also suggested that VAD could be used to avoid recording silence, thereby conserving resources.
U.S. Pat. No. 5,155,760 discloses a voice mail system including a voice recorder, a circular buffer, a voice activity detector (VAD), a prompt unit, a line interface unit, and an echo cancellation unit adjacent to the line interface unit. Caller input and prompt output are passed to the echo cancellation unit to allow echo cancellation to be performed. In operation, a prompt is played to the caller. Caller input is then routed to the VAD, and also to the circular buffer. In response to the VAD detecting voice, the caller input is fed to the voice recorder, along with the buffer contents. This ensures that the first part of the caller input (which triggered the VAD) is also properly recorded.
U.S. Pat. No. 5,4757,91 describes a voice processing system including a prompt unit, a buffer, and a digital signal processor (DSP) unit for performing echo cancellation, voice activity detection (VAD), and speech recognition. In this system, a prompt is played to the caller, and both the caller input and prompt are fed to the DSP unit to perform echo cancellation. The echo cancelled signal is fed (i) to the buffer, and (ii) to a VAD algorithm. On detecting voice, the outgoing prompt is terminated, and the DSP switches from echo cancellation mode to speech recognition mode, whereupon speech recognition is then performed on the caller input, including that stored in the buffer.
One of the drawbacks with the approach described in U.S. Pat. No. 5,4757,91 is that the DSP is required throughout the time that barge-in is enabled, firstly to perform echo cancellation, and then to perform voice recognition. However, DSP resource, particularly for voice recognition, is expensive, and this can prove prohibitive if the voice processing system is to support many lines of telephony input simultaneously.
Many voice processing systems include a special DSP card for running voice recognition software. This card is connected to the line interface unit for the transfer of telephony data by a time division multiplex (TDM) bus. Most commercial voice processing systems, more particularly their line interface units and DSP cards, conform to one of two standard architectures: either the Signal Computing System Architecture (SCSA), or the Multi-vendor Integration Protocol (MVIP).
A somewhat different configuration is described in GB 2280820, in which a voice processing system is connected via a local area network to a remote server, which provides a voice recognition facility. This approach is somewhat more complex than the TDM approach, given the data communication and management required, but does offer significantly increased flexibility. For example, there is no need to match a DSP card with a line interface unit conforming to the same bus architecture, and also a single server can support multiple voice processing systems (or vice versa). However, the existence of two such different configurations can cause problems for the user, who generally has to tailor their application for one specific configuration, constraining the generality of such application.
Note also that if the voice recognition system described in U.S. Pat. No. 5,475,791 is provided as part of the server system in the arrangement of GB 2280820, then the need to transmit the prompt output to the DSP to perform the echo cancellation becomes particularly troublesome. Thus having to transmit the prompt output to the remote system for echo cancellation will tend to double the bandwidth required between the voice recognition server and the voice processing system, increasing overall system costs.