The present invention relates to voice processing systems, and more particularly to voice processing systems which utilise particular processing resources, such as voice recognition systems.
Voice processing systems whereby callers interact over the telephone network with computerised equipment are very well-known in the art, and include voice mail systems, voice response units, and so on. Typically such systems ask a caller (or called party) questions using prerecorded prompts, and the caller inputs answers by pressing dual tone multiple frequency (DTMF) keys on their telephones. This approach has proved effective for simple interactions, but is clearly restricted in scope due to the limited number of available keys; on a telephone. For example, alphabetical input is particularly difficult using DTMF keys.
There has therefore been an increasing tendency in recent years for voice processing systems to use voice recognition in order to augment DTMF input. The adoption of voice recognition permits the handling of callers who do not have a DTMF phone, and also the acquisition of more complex information beyond simple numerals from the caller.
One particular concern with voice processing systems is to allow a caller to interrupt a prompt before it has finished (for example, if they are familiar with the system from regular use, such as might be the case with a voice mail system, and therefore know in advance their desired option). Most voice processing systems already allow a caller to interrupt a prompt by pressing an appropriate DTMF key. This is achieved by listening for incoming DTMF signals at the same time as an outgoing prompt is being played.
It is desirable to allow the caller to perform a similar interruption by speaking, rather than pressing a DTMF key. The caller input is processed by the voice recogniser, and the system then performs the requested action. The ability to accept such interruptions during an outgoing prompt is known as barge-in or cut-through.
One difficulty with the support of barge-in is that an outgoing prompt may be partially echoed back from the telephone network, and then accidentally mistaken for voice input from the caller, or else distort actual voice input from the caller. Many voice recognition systems therefore include an echo cancellation facility in order to facilitate barge-in. Effectively, such echo cancellation involves subtracting from the incoming signal a suitably transformed version of the outgoing signal, the intention being that the subtracted transformed version of the outgoing signal negates any echo that might be received.
The following documents are illustrative of the current art in the voice processing industry.
WO96/25733 (BT) describes a voice response system which includes a prompt unit, a Voice Activity Detector (VAD), and a voice recognition unit. In this system, as a prompt is played to the caller, any input from the caller is passed to the VAD, together with the output from the prompt unit. This allows the VAD to perform echo cancellation on the incoming signal. Then, in response to the detection of voice by the VAD, the prompt is discontinued, and the caller input is switched to the reco unit, thereby providing a barge-in facility.
U.S. Pat. No. 5,459,781 describes a voice processing system with a line interface unit, an echo cancellation unit (adjacent to the line interface unit), a voice activity detector (VAD), a prompt unit, a DTMF detector, and a recorder. In this system, both incoming and outgoing signals pass through the echo cancellation unit, where echo cancellation is performed. This system addresses the problem that caller voice input to the recorder may accidentally be mistaken for DTMF input. Therefore, if the VAD detects incoming speech to be recorded, the DTMF detection is switched off (since it is unlikely that the caller would make a genuine DTMF input at this time). It is also suggested that VAD could be used to avoid recording silence, thereby conserving resources.
U.S. Pat. No. 5,155,760 discloses a voice mail system including a voice recorder, a circular buffer, a voice activity detector (VAD), a prompt unit, a line interface unit, and an echo cancellation unit adjacent to the line interface unit. Caller input and prompt output are passed to the echo cancellation unit to allow echo cancellation to be performed. In operation, a prompt is played to the caller. Caller input is then routed to the VAD, and also to the circular buffer. In response to the VAD detecting voice, the caller input is fed to the voice recorder, along with the buffer contents. This ensures that the first part of the caller input (which triggered the VAD) is also properly recorded.
U.S. Pat. No. 5,475,791 describes a voice processing system including a prompt unit, a buffer, and a digital signal processor (DSP) unit for performing echo cancellation, voice activity detection (VAD), and speech recognition. In this system, a prompt is played to the caller, and both the caller input and prompt are fed to the DSP unit to perform echo cancellation. The echo cancelled signal is fed (i) to the buffer, and (ii) to a VAD algorithm. On detecting voice, the outgoing prompt is terminated, and the DSP switches from echo cancellation mode to speech recognition mode, whereupon speech recognition is then performed on the caller input, including that stored in the buffer.
One of the drawbacks with the approach described in U.S. Pat. No. 5,475,791 is that the DSP is required throughout the time that barge-in is enabled, firstly to perform echo cancellation, and then to perform voice recognition. However, DSP resource, particularly for voice recognition, is expensive, and this can prove prohibitive if the voice processing system is to support many lines of telephony input simultaneously.
Many voice processing systems include a special DSP card for running voice recognition software. This card is connected to the line interface unit for the transfer of telephony data by a time division multiplex (TDM) bus. Most commercial voice processing systems, more particularly their line interface units and DSP cards, conform to one of two standard architectures: either the Signal Computing System Architecture (SCSA), or the Multi-vendor Integration Protocol (MVIP).
A somewhat different configuration is described in GB 2280820, in which a voice processing system is connected via a local area network to a remote server, which provides a voice recognition facility. This approach is somewhat more complex than the TDM approach, given the data communication and management required, but does offer significantly increased flexibility. For example, there is no need to match a DSP card with a line interface unit conforming to the same bus architecture, and also a single server can support multiple voice processing systems (or vice versa). However, the existence of two such different configurations can cause problems for the user, who generally has to tailor their application for one specific configuration, constraining the generality of such application.
Note also that if the voice recognition system described in U.S. Pat. No. 5,475,791 is provided as part of the server system in the arrangement of GB 2280820, then the need to transmit the prompt output to the DSP to perform the echo cancellation becomes particularly troublesome. Thus having to transmit the prompt output to the remote system for echo cancellation will tend to double the bandwidth required between the voice recognition server and the voice processing system, increasing overall system costs.
Accordingly the present invention provides a voice processing system having multiple telephony channels for making and receiving telephony calls including:
means for communicating with a server including a voice recognition system for processing said telephony calls, said server being remote from said voice processing system;
means for playing a prompt to a caller over a telephone channel;
a voice activity detector for detecting caller input on said telephone channel;
means responsive to said detection of voice activity for initiating transmission of the caller input to said remote voice recognition system;
means for performing echo cancellation for said telephone call; and
a line interface unit which incorporates both said voice activity detector and said echo cancellation means, whereby the caller input is processed by said echo cancellation means prior to being processed by said voice activity detector and prior to the initiation of the caller input to said remote voice recognition system.
The above approach allows the flexibility of a remote voice server to be used in a highly efficient manner, avoiding unnecessary transmission of non-voice data (typically silence) over the network, and unnecessary processing by the voice recognition system itself.
The preferred embodiment further includes means for requesting the voice recognition system to allocate a voice recognition channel for a telephone call. Such a request can be made either prior to playing said prompt in said telephone call, or else responsive to said detection of signal energy in said telephone call. The former approach ensures that voice recognition resources once allocated will be available for a call, but does not maximise usage of the recognition resources, in that there will be some recognition channels allocated but not actually receiving incoming voice data. By contrast, the latter approach offers potentially greater efficiency, since recognition channels are only allocated when needed; ie, when voice activity is actually detected. However, this approach suffers firstly from possible delays in the allocation of recognition resource after voice energy has been detected, and secondly from problems if no recognition resource is available when requested.
The preferred embodiment also includes an echo cancellation facility comprising: means for receiving said prompt being played out; means for processing said prompt to generate an estimated echo signal; and means for subtracting said estimated echo signal from the caller input on said telephone channel. The use of echo cancellation reduces the risk that an echo of an outgoing prompt might be interpreted as caller input, and thus considered as barge-in. It is further preferred that both the voice activity detector and the echo cancellation means are included in the line interface unit of the voice processing system. Caller input is then processed by said echo cancellation means prior to being processed by said voice activity detector (to avoid the risk of accidental echo-induced barge-in). Note that performing the echo cancellation in the line interface unit (as opposed to in the voice recognition system for example) is most convenient, since firstly the echo is removed at the earliest possible point in the voice processing system, and secondly because the line interface unit already receives the outgoing signal from which the echo is calculated; in other words, there is no need to specially route a copy of the outgoing signal to a separate echo cancellation unit, in addition to that being passed to the line interface unit for transmission out over the telephone network.
The preferred embodiment further comprises means for buffering the caller input, wherein responsive to a detection of voice activity, the buffer contents are transmitted to said voice recognition system. Thus caller input for the time-lag between the onset of voice activity and the triggering of the voice activity detector can be preserved. This ensures that the start of the caller""s speech is not clipped, and therefore improves recognition accuracy.
The invention also provides a method of providing barge-in support on N telephony channels in a voice processing system including a line interface unit for connecting to N or more channels, using a voice recognition system capable of performing voice recognition on up to M channels simultaneously, where N greater than M and M greater than 1, the line interface unit also including a voice buffer and voice activity detection means, the method comprising for each of said N telephony channels:
transmitting an outgoing telephony signal through the line interface unit;
buffering the incoming telephony signal in the voice buffer;
detecting voice activity in the incoming telephony signal;
responsive to such a detection, initiating the forwarding of the incoming telephony signal plus buffered portion to the voice recognition system; and
performing echo cancellation in said line interface unit, said echo cancellation being initialised at the start of a telephone call.
This approach seeks to maximise recogniser efficiency by only consuming recogniser resource after voice activity has been detected. Otherwise, the usage of such voice recognition resources for barge-in is very wasteful, since they will be employed throughout the play-out of this prompt (ie for all the time that bargein is enabled), irrespective of when or whether the caller provides any input to be recognised.
In a preferred embodiment, the voice recognition system is provided on a remote server, and the step of forwarding the incoming telephony signal plus buffered portion to the voice recognition system comprises transmission over a local area network to the server. In these circumstances triggering data transmission upon the detection of voice activity not only improves the efficiency of recogniser usage, but also avoids unnecessary traffic on the network if there is no caller input of interest. It is also preferred that the server opens a virtual session to receive the incoming telephony signal, and subsequently allocates said virtual session to one of said M voice recognition channels. The use of flexible sessions provides the server with the flexibility to maximise the utilisation of its recognition channels.
It is also preferred that the method further comprises the step of performing echo cancellation in said line interface unit, said echo cancellation being initialised at the start of a telephone call. This is to be contrasted with many prior art systems, where the echo cancellation is performed by the recognition resource, and so no initialisation can be performed until the recognition resource is specifically requested and allocated. Note that the prior art approach is particularly wasteful where a single call can result in multiple accesses to the voice recognition resource system, leading to corresponding multiple initialisations of the echo cancellation facility.
The invention also provides a voice processing system having multiple telephony channels for handling telephony calls including:
a voice recognition resource;
means for playing a prompt to a caller over a telephone channel;
means for receiving caller input from said telephone channel;
means for storing said caller input in a buffer;
and means for forwarding said caller input from said buffer into the voice recognition resource at faster than real-time.
The use of the buffer allows incoming telephony data to be accumulated, and then sent at faster than real-time into the recogniser. This allows a recogniser channel to be operated at maximum capacity, rather than clocked by the real-time arrival rate of data. This in turn reduces the number of channels required on the recogniser, and so can reduce costs.
In the preferred embodiment, the voice recognition resource is provided on a server machine, with the buffer also being provided on said server machine, and being directly associated with the recognition resource. This allows buffer operation to be aligned to the processing capabilities of this particular recognition resource, for example, in terms of maximum data handling rate.
In one preferred arrangement, there are a plurality of virtual session handlers, each capable of processing caller input from one telephony channel, and each including a buffer for storing caller input from this telephony channel. The buffer control system then associates a virtual session with a real channel in said voice recognition system as appropriate, for example, as and when a free recognition channel becomes available.
Note that in the above context, references to callers should be interpreted as simply meaning the person to whom the voice processing system is connected over the telephone network. In other words, the call itself need not necessarily have been initiated by the caller him/herself, but perhaps by the voice processing system itself, or even by some third party (such as an agent at the call centre where the voice processing system might be located).