1. Field of the Invention
The present invention relates to voice processing systems and such like, and in particular to the way in which such systems can interact with callers.
2. Description of the Related Art
Voice processing systems whereby callers interact over the telephone network with computerised equipment are very well-known in the art, and include voice mail systems, voice response units, and so on. Typically such systems ask a caller (or called party) questions using prerecorded prompts, and the caller inputs answers by pressing dual tone multiple frequency (DTMF) keys on their telephone. In this manner, the caller can navigate through a hierarchy of prompt menus, for example to retrieve desired information, or to be connected eventually to a particular telephone extension or customer department.
There has been an increasing tendency in recent years for voice processing systems to use speech recognition (also sometimes called voice recognitionxe2x80x94the two terms are used interchangeably herein), in order to augment DTMF input. The adoption of speech recognition permits the handling of callers who do not have a DTMF phone, and also the acquisition of more complex information beyond simple numerals from the caller. Speech recognition in a telephony environment can be supported by a variety of hardware architectures. Many voice processing systems include a special DSP card for running speech recognition software (firmware or microcode), which is connected to a line interface unit for the transfer of telephony data via a time division multiplex (TDM) bus. Most commercial voice processing systems conform to one of two standard TDM bus architectures: either the Signal Computing System Architecture (SCSA), or the Multi-vendor Integration Protocol (MVIP). A somewhat different configuration is described in GB 2280820, in which a voice processing system is connected via a local area network to a remote server, which provides a voice recognition facility.
Voice processing systems such as interactive voice response systems (IVRs) run applications to play prerecorded prompts to callers. IVRs typically have a set of system provided audio segments for commonly used items, such as numbers, days of the week, and so on. Additional audio segments must then be recorded as required for any specific application. The prompts played to a caller for that application can then be formed from one or more system provided audio segments and/or one or more application specific audio segments, concatenated together as required.
One problem with this approach is that the voice used to record the application specific audio segments will generally sound different from the voice which was used to record the system provided audio segments. Therefore the output when a system provided audio segment is concatenated with an application specific prompt will sound slightly incongruous. One way around this difficulty is to have the person who records the application specific audio segments re-record the system provided audio segments, so that all are spoken with the same voice. However, the extra time for these re-recordings represents additional expense for the application developer, and the possible duplication of recorded audio segments can increase system storage requirements. These problems are particularly acute where the IVR is running two or more applications, if it is decided to re-record the system prompts separately for each application.
A similar problem is related specifically to voice mail systems (also termed voice messaging systems), which are used to store messages from incoming calls when the intended recipient is absent or otherwise engaged. The intended recipient (often referred to as a subscriber to the voice mail system) can then listen to their stored messages at some future time. A voice mail system is generally implemented either on special purpose computer hardware, or else on a standard computer workstation equipped with a suitable telephony interface. This system is then attached to (or into) the telephone network, typically via a switch or PBX. Such voice mail systems are well-known; one example is the DirectTalkMail system, available from IBM Corporation (now marketed as the IBM Message Center). Other examples of voice mail systems are described in U.S. Pat. No. 4,811,381 and EPA 0588576.
An important feature of many voice mail systems is their ability to provide callers with a personalized greeting for the intended recipient, for example: xe2x80x9cThe party you have called, JOHN SMITH, is unavailable at present. Please leave a message after the tone, or hit the zero key for further assistancexe2x80x9d. This greeting actually comprises three (or more) audio segments which the system automatically concatenates together for audio output:
(1) xe2x80x9cThe party you have calledxe2x80x9d
(2) xe2x80x9cJOHN SMITHxe2x80x9d
(3) xe2x80x9cis unavailable at present. Please leave a message after the tone, or hit the zero key for further assistancexe2x80x9d.
In this case the first and last segments may be standard audio segments provided by the voice mail system. By contrast, the middle segment (sometimes referred to as the xe2x80x9caudio namexe2x80x9d) is a separate audio segment which has to be specifically recorded by the subscriber. This is because it is very difficult to generate a spoken name automatically, for example with a text to speech system, because of the very wide range of names (many with unusual spellings), and also because of the variety of pronunciations used by different people even when they have the same name.
The use of such personalized greetings is further beneficial in voice mail systems, because hearing the name and indeed recorded voice of the subscriber reassures the caller that they have reached the correct mailbox. Nevertheless, the overall output can sound somewhat awkward in that the system provided audio segments (ie segments (1) and (3) above) may be spoken in a very different voice to that of the subscriber. This can then sound very cumbersome when they are concatenated together with the audio name of the subscriber.
The way to try to overcome this problem is to have the subscriber record the entire greeting, in other words, to record all three segments above (possibly as one long segment). Although this removes any disparity in sound between the different parts of the greeting, it is still not entirely satisfactory. For example, not all subscribers may be prepared for the additional effort required to produce the longer recording. This is particularly the case where the system may provide different greetings for different situations (eg one for general unavailability, one for when the subscriber has left the office for the night, etc), and where the system can normally re-use the same audio name recording for the different greetings. Therefore, if it is desired to have a greeting spoken in its entirety by a subscriber, then the subscribers may now be faced with having to record multiple greetings, rather than just a single audio name. Furthermore, even those subscribers that are prepared to record whole greetings may produce a greeting that is mumbled and difficult to understand, hesitant, lacks information, or has some other defect compared to the standard system audio segments. This in turn can reflect badly on the professionalism of the subscriber""s organization.
Accordingly, the invention provides a voice processing system for connection to a telephone network and running at least one application for controlling interaction with calls over the telephone network, said system comprising:
means for providing at least one audio segment recorded by a first speaker for use by said at least one application;
means for providing at least one vocal parameter characteristic of a second speaker;
means for applying said at least one vocal parameter to said audio segment to produce a modified audio segment such that said modified audio segment sounds substantially as if spoken by said second speaker; and
means for outputting the modified audio segment over the telephone network.
Thus the invention allows audio segments to be modified to sound as if spoken by a different person from the person who originally recorded the segment. This has many possible applications in a voice processing environment, for example where an audio output comprises two or more segments, including at least one recorded by the first speaker and at least one recorded by the second speaker. Thus each audio segment recorded by the first speaker can be modified such that the audio output sounds substantially as if all spoken by said second speaker. This facility could be used (amongst other possibilities) in a voice messaging system, where an audio name recorded by a subscriber is embedded into one or more system prompts (ie normalization of the audio name to the carrier segment or segments).
A slightly different possibility is where an application uses multiple audio segments. Each of said multiple audio segments may be modified such that the audio output sounds substantially as if all segments are spoken by said second speaker. This is useful where for example it is desired to update an application with new prompts, whilst still retaining some of the old prompts (or perhaps using some system provided prompts) and the person who recorded the old prompts or system prompts is no longer available to make the new recordings.
In the preferred embodiment, the system further comprises means for providing at least one vocal parameter characteristic of the first speaker. The modified audio segment can then be produced by altering at least one instantaneous vocal parameter of the segment, dependent on an average value of said vocal parameter for the second speaker relative to an average value of said vocal parameter for the first speaker.
It will be appreciated that the average vocal parameters of the first and second speakers can be determined in advance, perhaps on different machines, and made available for subsequent use. The average parameters for the first speaker may be derived directly from the audio segment to be modified, or from some other recording by that speaker. The instantaneous vocal parameters for the segment will then be typically determined and modified on the fly as part of the audio output process.
In the preferred embodiment the vocal parameters characteristic of the first and second speakers comprise for each respective speaker the average value of the frequency of at least two formats, a measure of the degree of distribution of the two format frequencies about their respective average values, the variation of the format bandwidth with format frequency, and fundamental frequency. It has generally been found to be most effective to use four formats for specifying the vocal parameters; (a lower number can reduce speaker discrimination, whilst a higher number does relatively little to improve it). Note however that the principle of the invention is not restricted to use of formats for vocal parameters, but could be adopted with any suitable mathematical characterization of the vocal tract. (NB as will be appreciated by the skilled person, formats represent peaks in the frequency response of the human vocal tract).
In accordance with the invention, an appropriate algorithm is provided for the modification of each of the above vocal parameters from a value characteristic of a first speaker to one characteristic of a second speaker. For example, an instantaneous fundamental frequency of an audio segment to be modified is altered from an original value to a new value, such that the displacement of the new value from the average value of said fundamental frequency as spoken by the second speaker is equal to the displacement of the original value from the average value of said fundamental frequency as spoken by the first speaker, scaled by the ratio of the respective average fundamental frequencies for said first and second speakers.
In the preferred embodiment, the audio segments are stored in compressed form using linear predictive coding (LPC), a well-known form of voice signal encoding. The transformation of vocal characteristics can be applied directly to stored audio data (ie without any need for decompression/compression). Indeed it is particularly straightforward to derive the necessary vocal parameters and perform the desired modifications directly on LPC data. This transformation could therefore be added performed by a preprocessor to a standard LPC decoder on-the-fly at decode time, or alternatively, it may be desired to separate the transformation from the decoding, perhaps updating a whole library of audio segments effectively in batch mode for output at some arbitrary time in the future.
Conceptually the reason why the invention is accommodated relatively easily using LPC is that this coding method effectively separates out an audio signal into an excitation source (ie the vocal chords) and a filter function (ie the vocal tract). The application of the vocal parameters to an audio segment recorded by a first speaker to produce a modified audio segment such that said modified audio segment sounds substantially as if spoken by a second speaker can then be performed by replacing the excitation source and filter function of the first speaker by the excitation source and filter function of the second speaker.
It will be appreciated of course that the same general approach (modeling and replacement of excitation source and filter function) can be employed even if LPC is not used (LPC is based upon an all-pole filter function, but other filter types may be used). The invention may also be employed with other coding schemes apart from LPC, for example those used for Internet telephony, where the voice processing system is connected to the Internet or other TCP/IP based network. Note that in such environments, the voice information is generally transmitted over the network in compressed format.
The invention further provides a method for modifying audio output from a voice processing system running at least one application for controlling interaction with calls over a telephone network, said system comprising:
providing at least one audio segment recorded by a first speaker for use by said at least one application and at least one vocal parameter characteristic of a second speaker;
applying said at least one vocal parameter to said audio segment to produce a modified audio segment such that said modified audio segment sounds substantially as if spoken by said second speaker; and
outputting the modified audio segment over the telephone network.
The invention further provides apparatus for running an application which plays out audio segments, said apparatus comprising:
means for providing a set of audio segments recorded by a first speaker for use by said application;
means for providing at least one vocal parameter characteristic of a second speaker;
means for applying said at least one vocal parameter to said set of audio segments to produce a modified set of audio segments such that said modified set of audio segments sounds substantially as if spoken by said second speaker; and
means for outputting the set of modified audio segments.
The invention further provides a method for updating an application which plays out a set of audio segments recorded by a first speaker, comprising:
providing at least one vocal parameter characteristic of a second speaker;
applying said at least one vocal parameter to said set of audio segments to produce a modified set of audio segments such that said modified set of audio segments sounds substantially as if spoken by said second speaker; and
outputting the set of modified audio segments.