The present invention relates to speech or voice recognition systems, and more particularly to speech recognition systems for use in voice processing systems and the like.
Voice processing systems whereby callers interact over the telephone network with computerised equipment are very well-known in the art, and include voice mail systems, voice response units, and so on. Typically such systems ask a caller (or called party) questions using prerecorded prompts, and the caller inputs answers by pressing dual tone multiple frequency (DTMF) keys on their telephones. This approach has proved effective for simple interactions, but is clearly restricted in scope due to the limited number of available keys on a telephone. For example, alphabetical input is particularly difficult using DTMF keys.
There has therefore been an increasing tendency in recent years for voice processing systems to use voice recognition in order to augment DTMF input. The adoption of voice recognition permits the handling of callers who do not have a DTMF phone, and also the acquisition of more complex information beyond simple numerals from the caller.
As an illustration of the current art in the voice processing industry, WO96/25733 describes a voice response system which includes a prompt unit, a Voice Activity Detector (VAD), and a voice recognition unit. In this system, as a prompt is played to the caller, any from the caller is passed to the VAD, together with the output from the prompt unit. This allows the VAD to perform echo cancellation on the incoming signal. Then, in response to the detection of voice by the VAD, the prompt is discontinued, and the caller input is switched to the recognition unit, thereby providing a barge-in facility.
Voice recognition in a telephony environment can be supported by a variety of hardware architectures. Many voice processing systems include a special DSP card for running voice recognition software. This card is connected to a line interface unit for the transfer of telephony data by a time division multiplex (TDM) bus. Most commercial voice processing systems, more particularly their line interface units and DSP cards, conform to one of two standard architectures: either the Signal Computing System Architecture (SCSA), or the Multi-vendor integration Protocol (MVIP). A somewhat different configuration is described in GB 2280820, in which a voice processing system is connected via a local area network to a remote server, which provides a voice recognition facility. This approach is somewhat more complex than the TDM approach, given the data communication and management required, but does offer significantly increased flexibility.
Speech recognition systems are generally used in telephony environments as cost-effective substitutes for human agents, and are adequate for performing simple, routine tasks. It is important that such tasks are performed accurately otherwise there maybe significant customer dissatisfaction, and also as quickly as possible, both to improve caller throughput, and also because the owner of the voice processing system is often paying for the call via some FreePhone mechanism (e.g., an 800 number). With continuing improvements in recognition accuracy, the current generation of speech recognition systems are starting to be used in more and more complex situations, which have hitherto been the exclusive realm of human operators. Nevertheless, even with their impressive ability to recognise speech, such systems are still deficient at providing as complete a service to the caller as a human agent could provide.
It is also known to use lie detection technology on an audio signal over the telephone, in particular the TrusterPro product from Trustech Innovative Technologies Ltd. of Israel. However, this represents a relatively specialised approach, which does not appear to be directly compatible with those techniques used for speech recognition (therefore making integration more difficult).
Accordingly, the present invention provides a method of performing speech recognition comprising the steps of:
receiving acoustic spoken input;
processing said acoustic input by performing speech recognition to determine (i) a text equivalent; and (ii) a manner in which said spoken input was rendered; and
performing a further operation, dependent on the manner in which said spoken input was rendered.
Thus the speech recognition system of the invention can discriminate between at least two of the different manners in which the same set of words are spoken, for example, confidently, hesitatingly, angrily, and so on. This manner information supplements the recognised text, to provide additional information about what is being communicated by the speaker. In contrast, prior art systems have provided only the plain text. Thus the present invention provides a more human-like capability for a recognition system, and therefore enhanced power in terms of comprehension and flexibility.
In the preferred embodiment, said processing step is performed using a Hidden Markov Model (HMM) as the basis for the recognition engine. The HMM can be trained to discriminate amongst a predetermined set of available manners, so that said processing step determines which manner from said predetermined set of available manners best corresponds to the manner in which said spoken input was rendered.
Preferably the spoken input comprises a single word, or possibly a short sequence of words. This considerably reduces the complexity of the overall recognition task, thereby rendering more practicable the discrimination according to manner, which is much more easily discerned in relation to a single word than to a whole phrase or sentence (training for a whole phrase would also be extremely difficult, given the wide variety of possible deliveries which would still correspond to the same basic emotion).
In the preferred embodiment, said processing step further comprises determining a confidence level associated with the recognition of the text equivalent. This confidence level determines how good is the match between the caller input and the recognition model. Note that this is not completely independent of manner, in that an utterance spoken in an unusual manner is likely to give rise to a low recognition confidence. Nevertheless, there is a clear difference between determination of confidence, which describes how well one model fits the caller input, and determination of manner, which effectively tries to determine which of multiple models best fits the caller input. (Note that it would be feasible to also provide a confidence estimate for the manner determination in addition to the text recognition).
The invention also provides a speech recognition system comprising:
means for receiving an acoustic spoken input; means for processing said acoustic input by performing speech recognition to determine (i) a text equivalent; and (ii) a manner in which said spoken input was rendered; and
means for performing a further operation, dependent on the manner in which said spoken input was rendered.
In the preferred embodiment, such a speech recognition system is incorporated into a voice processing system which is connected to a telephone network, and said spoken input is received from a caller over the telephone network. In such an environment, the performing means comprises a voice processing application running on the voice processing system. This application can then move to a different part of a voice processing menu hierarchy, dependent on the manner in which said spoken input was rendered. For example, if a caller sounds confused or angry, the system may decide to transfer them to a human operator, rather than to otherwise continue through the application. Likewise, if a caller sounds as if he or she is in a particular hurry, the system may decide to process their call with a higher priority than otherwise.
The invention further provides a method of training a speech recognition system including a Hidden Markov model comprising the steps of:
collecting samples of acoustic spoken input data of a particular text;
marking for each sample the manner in which the text was spoken; and
training the HMM to discriminate acoustic spoken input data according to the manner in which it is spoken.
A particular benefit of the invention is that it can be implemented with essentially existing recognition engines based on HMMs, but that these need to be trained to provide the appropriate discrimination of manner. In order to obtain the necessary input data, one possibility is to employ an actor to read text according to a predetermined set of different manners (e.g., with confidence, in anger, and so on). More realistic results may come however for voice processing applications from using real call data to train the system.