Many companies interact with their customers via electronic means (most commonly via telephone, e-mail, and online text chat). Such electronic systems save the companies a large amount of money by limiting the number of customer service or support agents needed. These electronic systems, however, generally provide a less than satisfactory customer experience. The customer experience may be acceptable for simple transactions, but are frequently inconsistent or downright frustrating if the customer is not adept at talking to or interacting with a computer.
Such interactive response systems are well known in the art. For example, providing customer service via telephone using an interactive voice response (IVR) system is one such system. An example of customer service systems utilizing IVR technology is described in U.S. Pat. No. 6,411,686. An IVR system typically communicates with customers using a set of prerecorded phrases, responds to some spoken input and touch-tone signals, and can route or transfer calls. A drawback to such IVR systems is that they are normally built around a “menu” structure, which presents callers with just a few valid options at a time and require a narrow range of responses from callers.
Many of these IVR systems now incorporate speech recognition technology. An example of a system incorporating speech recognition technology is described in U.S. Pat. No. 6,499,013. The robustness of the speech recognition technology used by IVR systems vary, but often have a predetermined range of responses that they listen for and can understand, which limits the ability of the end user to interact with the system in everyday language. Therefore, the caller will often feel that they are being forced to speak to the system “as though they are talking to a computer.” Moreover, even when interacting with a system that utilizes speech recognition, customer input is often either not recognized or incorrectly determined, causing the customer to seek a connection to a human customer service agent as soon as possible.
Human customer service agents continue to be used for more involved customer service requests. These agents may speak to the customer over the phone, respond to customer e-mails, and chat with customers online. Agents normally answer customer questions or respond to customer requests. Companies have customer service groups, which are sometimes outsourced to businesses that specialize in “customer relations management.” Such businesses run centers staffed by hundreds of agents who spend their entire working day on the phone or otherwise interacting with customers. An example of such system is described in U.S. Pat. No. 5,987,116.
The typical model of customer service interaction is for one agent to assist a customer for the duration of the customer's interaction. At times, one agent (for example, a technical support representative) may transfer the customer to another agent (such as a sales representative) if the customer needs help with multiple requests. But in general, one agent spends his or her time assisting that one customer for the full duration of the customer's call or chat session, or is occupied resolving the customer's issue via e-mail. Most call centers also expect the agent to take the time to log (document) the call. Deficiencies in this heavy agent interface model is (1) there is a high agent turnover rate and (2) a great deal of initial and ongoing agent training is usually required, which all add up to making customer service a significant expense for these customer service providers.
In order to alleviate some of the expenses associated with agents, some organizations outsource their customer service needs. One trend in the United States in recent years, as high-speed fiber optic voice and data networks have proliferated, is to locate customer service centers overseas to take advantage of lower labor costs. Such outsourcing requires that the overseas customer service agents be fluent in English. In cases where these agents are used for telephone-based support, the agent's ability to understand and speak clearly in English is often an issue. An unfortunate result of off shore outsourcing is misunderstanding and a less than satisfactory customer service experience for the person seeking service.
Improved interactive response systems blend computer-implemented speech recognition with intermittent use of human agents. To some extent, this has been done for years; U.S. Pat. No. 5,033,088 addresses a system using both a human attendant and an automated speech recognizer. Likewise, U.S. Pat. No. 7,606,718 discloses a system in which a human agent is presented with only portions of a call requiring human interpretation of a user's utterance. The contents of these patents, as well as all other art referred to herein, is hereby incorporated by reference as is fully set forth herein. Interest in such systems is enhanced if they are relatively low in cost, which generally calls for limited human interaction. To achieve such limited human interaction, it would be desirable to have a system that required minimal initial training and for which results continued to improve over time. In particular, a learning/training system that provides “day-one” performance that is suitable for production use and that improves in efficiency quickly over time would be particularly valuable.
Many existing ASR systems suffer from serious training constraints such as the need to be trained to recognize the voice of each particular user of the system or the need to severely limit recognized vocabulary in order to provide reasonable results. Such systems are readily recognizable by users as being artificial. Consider the difference between the typical human prompt, “How can I help you?” and the artificial prompt, “Say MAKE if you want to make a reservation, STATUS if you would like to check on status of a reservation, or CANCEL to cancel a reservation.”
A goal of voice systems with ASR (Automated Speech Recognition) was to achieve a conversational system to perform caller interaction, much like HAL in “2001: A Space Odyssey”. To improve ASR capability, Voice User Interface (VUI) techniques have been developed to phrase prompts precisely and compactly in an attempt to reduce the vocabulary used and give the caller hints about the words they should speak to achieve higher accuracy speech recognition. Since then, ASR has improved and now addresses recognition of open-ended conversations. However, such open-ended conversations involve much larger vocabularies, resulting in much higher speech recognition error rates. The result is that callers are left with more frustration with and disdain for IVR systems based on, for instance, excessive confirmations of what was previously stated and understood, making incorrect choices, and forcing callers to back up to a previous menu. VUI designs attempt to lead the caller into what is known as a “directed dialog”, trying to narrow conversation from the general to the specific. Because small domains have a limited vocabulary and a significantly smaller repertoire of utterances, ASR and NLU have been more successful when applied to directed dialogs. The IVR industry is working to characterize knowledge domains using statistics and “search” with speech recognition to further increase understanding. However, these approaches still handle a significant number of callers poorly, especially those with dialects or pronunciation patterns that are difficult to understand even with sophisticated techniques such as building personalized ASR acoustic models. With the emergence of human-assisted recognition, there are now opportunities to leverage human understanding to recognize speech, text, graphics and video in conjunction with automation, making understanding more accurate and avoiding many of the weaknesses of ASR-based IVR systems. The fundamental task of IVR systems is to coordinate the filling of information slots in a range of business forms corresponding to user requests. In traditional IVR systems, this coordination is typically performed following a decision tree, fixed in advance, where there is little deviation from a restricted number of ways of interacting with users. Different kinds of recognition strategies have been developed, including variations in VUI design, different criteria that optimize for successful identification of accurate understanding, and techniques for understanding and recognition in the shortest possible time.
There are many reasons for a system to use a variety of appropriate techniques to make the interactions between a caller and automated system using human-assisted recognition as seamless and natural as possible.
Humans recognize and interpret meaning with much higher accuracy than Automated Speech Recognition (ASR), Graphics and Video Processing, and Natural Language Understanding (NLU) techniques. If humans can be used to understand when automation is insufficiently accurate, it now becomes possible to automate substantially more user interactions while still providing a good user experience. However, unlike computer resources, which can scale to meet unusual and unpredicted volume peaks, human resources need to be scheduled and may not be available in a timely manner for peaks. There is consequently a need for a system to automatically adjust to the required amount of HSR for any particular application, even using DTMF (dual-tone multi-frequency) when accuracy is not sufficient, to minimize the use of HSR. Even though the human interaction would change during unscheduled peaks, self-service could continue to be performed in a more traditional manner.
The traditional techniques used for tuning speech recognition and classifying recognized utterances to achieve the highest level of recognition change in subtle but important ways when the goal now becomes how to combine human-assisted and automation to best recognize and interpret the caller's utterances and at the same time achieve the most human-like user experience possible. Thus, a challenge not addressed by existing systems is how to use the most efficient combination of humans and automation in the given circumstances, under the given workload, while providing the most successful user experience.
Traditionally ASR systems start “listening” to utterances as they are spoken. If recognition automation fails, then the user would wait for the length of time that the complete utterances would take to be spoken before HSR would start listening and process it. It would be desirable if a system could attempt to understand the interaction in as close to real-time instead. For example, as the user speaks more and more words to describe their meaning (or “intent”), processing first by ASR and subsequently by HSR results in a significant time gap between the end of an utterance and the beginning of a response. This time gap could be filled, for example, with an audio play such as a typing sound. For some applications, this could be successful, especially for those applications that collect data. For other applications, this time gap makes it difficult to carry on a natural conversation with the system. In addition, longer speech also often results in lower recognition quality. Longer speech contains not only more words but also more word combinations. Taken together, these increase speech recognition errors and reduce understanding accuracy.
Therefore, an automated recognition system is needed that can understand as soon as possible to predict successful recognition prior to using human assistance to maintain human-like interactions. Furthermore, since human assistance may be called upon, this automated recognition system also needs the ability to monitor staffing of human assistance to adjust understanding confidence automatically and/or to go to complete automation depending on system status load and human assistance skill set capability.
Systems that are more ambitious, such as Natural Language Understanding (NLU) systems, require extensive machine learning periods of laborious hand-crafted grammar writing in order to get usable results from larger grammars and vocabularies. Particularly in environments in which vocabulary may be dynamic (such as a system to take ticket orders for a new play or for a concert by a new musical group), the learning period may be far too long to provide satisfactory results. Inclusion of accents, dialects, regional differences in vocabulary and grammar and the like further complicate the task of teaching such systems so that they can achieve reasonable thresholds of recognition accuracy.
ASR systems currently available are effective at recognizing simple spoken utterances such as numbers, data, and simple grammars (i.e., a small set of words and expressions made from them). However, to date ASR systems have not provided a high enough level of speech recognition performance to create a voice interface that provides a free-flowing conversation. Additionally, ASR performance degrades not only with accents and dialects as noted above, but also with background noise, adult rather than child voices, and, in many cases, female rather than male voices. ASR performance is improving over time, with some systems using statistical language models intended to recognize an extremely wide range of responses from callers, so that callers can be recognized even when they speak naturally rather than in a highly constrained manner. Even so, ASR performance has not yet rivaled actual interaction between humans, and the ASR systems that provide the highest levels of performance are time consuming and expensive to build and to tune for specific applications.
Tuning of grammars by considering statistical probabilities of various expected answers, as well as synonyms, is one technique used to improve ASR performance. Another is development of statistical language models, which can involve significant efforts to transcribe recordings of utterances of live phone conversations with live operators. ASR performance is quite acceptable in certain applications but is not yet suitable for others, so known ASR-based systems continue to lack the capability to understand natural unconstrained utterances.
Therefore, there remains a need in the art for an interactive system that provides a consistently high-quality experience without the limitations of constituent ASR components.