1. Field of the Invention
The present invention relates to unsupervised training techniques for training natural language call routing systems to improve speech recognition, topic recognition and language models, and a system and method therefor.
2. Discussion of the Related Art
Companies that expect to receive large numbers of telephone calls from customers seeking to access information or perform transactions typically employ call centers to act as the initial interface with the callers. Call centers generally include hardware and software, such as private branch exchanges, routers, interactive voice response systems (IVRs), and other components, that direct callers to a person, software program or other source of information best suited to meet the needs of the caller.
To date, the vast majority of such systems have been touch-tone systems. Call centers using touch-tone technology typically employ a complex hierarchy of touch-tone menus to provide self-service in the IVR or to enable skills-based routing. Skills-based routing is a known technique that attempts to match the caller to an automated fulfillment application, or, if no appropriate automated fulfillment is available, to a customer service agent who has the skill set to handle the caller's needs. Ideally, such routing allows the call center to save on agent labor and training costs while improving customer service.
However, it has been found that as menu complexity increases, IVR usage decreases because callers become frustrated, and routing mistakes increase because of caller confusion. Such problems typically manifest themselves in the caller pressing “0” or “pounding out”, or doing nothing, as if the caller had a rotary dial telephone, in an effort to short-circuit the IVR system and connect to a live agent. The enterprise can benefit substantially by increasing usage of the IVR and decreasing routing mistakes, in turn decreasing the amount of time that agents must spend on each call. The less time agents spend on tasks that could be automated in the IVR, the more time they have to perform important revenue generating functions such as sales.
Analysis of how effective complex menu hierarchies are at routing callers has shown that typically only 60% of callers on average are routed either to the correct agent or self-service, i.e., automated fulfillment, destination in the IVR. Callers misroute themselves because they do not understand the menu choices, mis-key their selection, or simply press “0” to get to an agent who must then determine the correct destination.
Natural language call routing systems have been developed with the goal of helping to solve these problems by cutting through the tangle of call flow options and letting callers state their purpose in their own words. As a comparison of FIGS. 1A and 1B will illustrate, what once took many layers of sub-menus with touch-tones can be handled by a single open-ended prompt using natural language call routing.
As shown in FIG. 1A, in a touch-tone, menu-driven system, a caller will encounter a main menu and touch tone/prompt 100. Typically, the menu prompts the caller to press a touch tone key to access, for example, billing information or other services. In response to the touch tone entry, the caller may encounter a second level of touch tone/prompts 102. Some of these second level prompts may result in the call being terminated, while others will lead, by further touch tone entries by the caller, to a third level of touch tone/prompts 104, and, in some cases, where automated fulfillment does not satisfy the customer's needs, to specialists, such as Specialist A 106, Specialist B 107 and Specialist C 108. Each specialist would typically be responsible for a particular customer service.
On the other hand, as is illustrated in FIG. 1B, in a natural language call routing system, a main menu/call router prompt 200 is provided that, in response to a voice input, routes the caller to one of agents 206, 207, and 208 or automated fulfillment 209. Natural language call routing substantially improves the overall rate of successful routes because it routes callers more accurately than they can route themselves using the touch-tone menus, and there is a higher rate of caller participation because the interface is more user-friendly than the touch-tone menu system.
In a conventional natural language call routing system, illustrated in FIG. 2, the caller is greeted with an open-ended prompt, such as “Please tell me, briefly, the reason for your call today.” Callers may then respond in their own words. Using statistical grammars and topic models 302, speech recognizer 304 transcribes the spoken response into a sequence of recognized words. The language understanding engine, or topic identifier, 306 then uses statistical topic-identification technology to determine the reason for the call from the sequence of recognized words and the trained topic models. Finally, the IVR router 308 transfers the call to an area of the IVR where the caller can self-serve, using automated services, or to a customer service agent. In cases where the system is not sufficiently confident of the topic, the caller will hear a directed re-prompt that lists the available options. This style of re-prompting guides the caller to an acceptable response and increases the overall routing accuracy.
The benefit of using a natural language call routing system is fundamentally dependent upon the routing accuracy. The best natural language call routing systems use statistical models both for speech recognition and language understanding. By using statistical n-gram grammars for speech recognition, it is possible to robustly model a huge set of possible customer queries. Unlike phrase-based grammars that work only when the customer utters one of a pre-defined set of sentences (i.e., the grammar), statistical n-gram grammars allow the recognition of customer utterances where the customer uses his or her own natural style to describe the reason for their call. This is achieved by assigning a continuous statistical score to every hypothesized sequence of words. N-gram models are well-known and are described, for example, in F. Jelinek, R. L. Mercer and S. Roukos, “Principles of Lexical Language Modeling for Speech Recognition”, in Readings in Speech Recognition, edited by A. Waibel and Kai-Fu Lee, pages 651–699, Morgan Kaufmann Publishers, 1990, H. Witten and T. C. Bell. “The Zero Frequency Estimation Of Probabilities of Novel Events in Adaptive Text Compression”, IEEE Transactions on Information theory, volume 7, number 4, pages 1085–1094, 1991, and P. Placeway, R. Schwartz, P. Fung and L. Nguyen. “Estimation of Powerful LM from Small and Large Corpora”, Proceedings of the International Conference on Acoustics, Speech and Signal Processing, Volume 2, pages 33–36, 1993. On the other hand, phrase-based grammars are far more discrete bi-modal: they assign a high score to an utterance that conforms with the pre-defined grammar and a very low score to a non-conforming utterance.
After speech recognition has been performed to determine the content of the customer's utterance, the second part of the natural language call routing solution is topic identification, which uses the statistical topic identification engine, or topic identifier. The topic identifier models the statistical relationship between the words used in the caller inquiries for specific topics.
Before deployment, and possibly at other times as well, the topic identifier undergoes a training process in which a matrix of word probabilities, i.e., the likelihood of occurrence of each word, for each topic is generated. The training process also calculates the prior probability of each topic, i.e., the probability that the next caller will ask about that topic. These two parameters are supplied as inputs to the topic identifier, which then determines the topic by applying statistical techniques like Bayes law to generate the likelihoods for each topic given the observed word sequence. Statistical speech recognition and topic-identification systems are well-known and are described, for example, in F. Jelinek, “Statistical Methods for Speech Recognition”, MIT Press, 1997, pp. 57–78, and John Golden, Owen Kimball, Man-Hung Siu, and Herbert Gish, “Automatic Topic Identification for Two-Level Call Routing,” Proc. ICASSP, vol. I, pp. 509–512, 1999.
An ordinary recognizer, without the language-understanding component, can be used alone to identify the topics for routing. However, such usage requires phrase-based grammars that explicitly list the various sentences that an “expert” believes the customer might reasonably say while describing the reason for their call. Optimal routing performance is obtained by adding a language-understanding engine, such as the topic identifier, that is trained on responses from actual customers, to capture all the variability of their speech. In terms of error rate, a natural language call routing system using statistical models for both speech recognition and topic identification outperforms a system using an ordinary phrase-based grammar recognizer by a factor of approximately 2 to 1.
Deployment of a natural language call routing system is a well-defined process that involves more than simply initiating operation of an off-the-shelf item. To ensure optimum performance, the routing system must be tuned to meet the particular needs of the call center in which it will be deployed. Deploying a natural language call routing system on a production IVR is a process in which configuration and installation generally proceed in four steps.
First, the topics that will form the basis for routing calls in the call center are identified and defined. These topics preferably should cover the range of reasons why customers call the center. Typically, each topic that will trigger routing to a specialist or self-service automation should receive at least 2% of caller traffic; any topic receiving less than this should be either combined with other similar topics into a larger, more general category, or categorized as a topic that should go to a general agent.
Second, responses to an open-ended routing prompt must be collected from real customers. One possible technique is to use what is known in the art as a “Wizard-of-Oz” system, in which real customer service agents listen to and route the calls, using pre-recorded prompts to guide the callers. This type of a system gives callers a simulated experience of a natural language call router without any actual investment in speech recognition technology. Another technique is to use a prototype of the eventual speech recognition system where the speech and language models are created based on anticipated caller responses. Until the models are retrained they will perform sub-optimally, which may be sufficient to expose to a limited number of customers in a trial situation simply to capture the caller responses. With either technique, the customers' responses are recorded for later analysis.
Third, the customer responses collected in the second step are transcribed and annotated with the correct routing topic. This is typically a manual process.
Fourth, using the annotated data, the speech recognition and the topic identification models are trained to correctly classify caller responses to the initial prompt. Training involves maximum likelihood estimation of both standard n-gram language models for speech recognition, such as are described in the Jelinek et al. reference, and statistical (e.g., Bayesian) models for topic identification such as are described in the Golden et al. reference. The output of this step forms the initial configuration for the natural language call router.
A fifth step may sometimes be required to train voice models to recognize regional accents. Most commercial speech recognizers have standard voice models that provide acceptable performance across a wide footprint, but the best performance is obtained by training the acoustic models on regional data. Training acoustic models on data collected from the deployment call-centers is an optional fifth step that can lead significant benefits in performance. While most commercial speech recognizers are shipped with so called “stock models” that deliver acceptable performance across a wide footprint, experimental evidence clearly demonstrates the usefulness of re-training, or adapting, the models to data from the current domain/deployment site. An important benefit of the increase in accuracy from better acoustic models is that this improvement is independent of the improvement in the language models. As such any newly correctly recognized words provide additional information for training better language models as well as better topic models.
With the completion of these four steps, or five steps with voice model training, the call routing system can be deployed on a production IVR platform. Aside from replacing the touch-tone routing menus with the natural language call routing, and speech-enabling sub-dialogues as needed (e.g., to collect user identification information), the call flows from the touch-tone IVR may remain unchanged.
FIG. 3 illustrates a conventional approach for voice and language model training for a call routing application. In this approach a trainer 5 utilizes human transcription and annotation. The natural language call routing application 1 to be trained has a speech recognizer 2 which converts the caller speech into text, a topic ID classifier 3 that determines the topic from the text, and instructs the IVR router where to send the call, whether to an automated self-service 13 or to a call center agent 14.
The voice and language model training process includes seven steps. First, spoken responses to a routing prompt are recorded to collect audio data, which is then stored in a waveform database 6. This can be done using an audio logging feature that would typically be provided with an IVR platform. Second, the collected data is transcribed by hand and annotated with a topic at 7 that would indicate how the spoken response should be routed. The topics and transcripts are stored in a training database 8, along with an identifier for each entry that allows that entry to be uniquely correlated to the particular call from which the information was gathered. Third, feature selection 9 is performed. The features are the words from the training data; these must be ranked by how well they help to separate the training examples into their labeled topics. The feature selection process increases the efficiency of the speech recognition process by reducing the size of the active dictionary that must be searched. For example, the well-known Kullback-Leibler distance metric (KLD), which measures the cross-entropy between the query words and the topic distributions, as described in S. Kullback, R. A. Leibler, “On Information and Sufficiency,” Ann. Math. Stat. Vol 22, pp 79–86, 1951, may be used. For a given topic and word, if the KLD is large, it means that this word occurs much more frequently in queries about this topic than any other topic and thus should be useful in categorizing future queries. Finally the 300–500 highest ranked words, or more, are selected on this basis.
Fourth, a topic trainer 12 uses the transcribed and annotated data in the training database 8 to build statistical models of how the selected feature words relate to the topics. Preferably, the topic trainer 12 uses a maximum likelihood estimation algorithm to build the statistical models. With maximum likelihood estimation, the parameters for the statistical models that maximize the likelihood of generating the observed training data are selected. The parameters of the model, mentioned above, are the matrix of p(wi|Tj), the probabilities of each keyword given each topic, and p(Tj), the prior probabilities of each topic. Estimation of these parameters is achieved by using a maximum likelihood method such as the one described in the Golden et al. article. However, other statistical methods can be used. The output of the trainer forms topic classifier configuration data and is input to the Topic ID classifier 3.
Fifth, a statistical grammar builder 11 for a speech recognizer is run on the collected transcribed data. The statistical grammar builder 11 uses the vocabulary found in the training data, typically greater than 1000 words, and the list of keywords selected by the feature selection 9 to construct statistical grammar relating to approximately 800 words to be input to the speech recognizer 2. The set of words used by the topic identification system to classify speech recognition output is called the keyword set or keywords. The keyword list could contain all the words in the recognition vocabulary but is usually a subset of the same. In order to accurately recognize the keywords it is beneficial to add extra words into the grammar for the speech recognizer. Research has shown the best words to add are either the most common non-keywords or all of the words that precede or follow keywords or both. It has been found that for training purposes, the most common non-keywords approach was most effective.
The most common non-keywords algorithm works as follows:    1. Create a word list containing all non-keywords from the training corpus    2. Determine the frequency of each word in the list    3. Rank the words by frequency, counting the number of words at each frequency    4. Choose the target maximum percentage of non-keywords to add. Namely what ratio of keywords to non-keywords is desired; use 3:1 as the default.    5. Using the frequency distribution, count down from the top frequency until the cumulative word count exceeds the target non-keyword percentage, or ⅓ of the number of keywords. This determines the minimum frequency.    6. Add all words that meet or exceed this minimum frequency.
Note that this approach may add more non-keywords than the target ratio, however, it circumvents selecting among equally popular non-keywords.
Sixth, the speech recognizer 2 uses a speech recognition process to turn input speech into text in an online system using the resulting grammar. The speech recognizer 2 must support statistical language models, that is, must be able to use statistical grammar input to it in the speech recognition process. Finally, topic classifier 3 takes text output by the speech recognizer 2 and, using the input classifier configuration data, generates a list of topics and confidence scores that it passes to the IVR routing mechanism 4. The topic identifier calculates the posterior probability of each topic given the input words or caller utterance, p(Tj|u). Using a Bayesian approach this can be calculated from the conditional probability of the utterance given the topic, p(u|Tj), and the prior probabilities of the topics, p(Tj). It uses a multinomial model of the keywords for the former, p(u|Tj)=product over all keywords wi=p(wi|Tj)^ni(u), where ni(u)=number of times that keyword wi appears in u. The prior probabilities of the topics p(Tj) and the probabilities of the words given the topics, p(wi|Tj), have been estimated during training, as described above. The confidence scores indicate how relevant each topic was to the input text and are typically expressed as a percentage. The topic classifier 3 then maximizes p(Tj|u) over all topics to select the most likely topic for the input utterance.
The acoustic model trainer, used for example to assist the speech recognizer in recognizing regional accents, may utilize a vector-quantized (VQ) model. In order to train the (VQ) acoustic model, it is required first to build a phonetic dictionary of all words in the transcribed text. The dictionary, transcribed text and the associated speech waveforms are then used to train the quantization codebooks and the associated probability mass functions in the VQ model. Training is performed using the standard Baum-Welch algorithm, described, for example, in F. Jelinek, Statistical Methods for Speech Recognition, referenced above, that generates maximum likelihood estimates of the model parameters. The vocabulary used during recognition need not be the same as the training dictionary. Typically, interpolated models are created for new phoneme sequences that have not been seen during training.
As has been described above, in natural language call routing systems, speech recognition programs are used to help determine what the caller is saying. However, it is not enough simply to recognize the words spoken by the caller. Topics corresponding to the spoken words must be ascertained to allow routing of the call to the appropriate specialist. It is important to maintain, and even improve, the accuracy of the system over time, in the least expensive manner.
A conventional approach to such training has been described in connection with the conventional trainer 5 discussed above with respect to FIG. 3. However, in the conventional trainer 5, the important tasks of call transcription and topic annotation used to supply the training data are done manually. Such manual text transcription and topic annotation processes have been found to be a training bottleneck, substantially increasing the cost and the turnaround time for deploying improved configuration data.