Many companies interact with their customers via electronic means (most commonly via telephone, e-mail, and online text chat). Such electronic systems save the companies a large amount of money by limiting the number of customer service or support agents needed. These electronic systems, however, generally provide a less than satisfactory customer experience. The customer experience may be acceptable for simple transactions, but are frequently inconsistent or downright frustrating if the customer is not adept at talking to or interacting with a computer.
Such interactive response systems are well known in the art. For example, providing customer service via telephone using an interactive voice response (IVR) system is one such system. An example of customer service systems utilizing IVR technology is described in U.S. Pat. No. 6,411,686. An IVR system typically communicates with customers using a set of prerecorded phrases, responds to some spoken input and touch-tone signals, and can route or transfer calls. A drawback to such IVR systems is that they are normally built around a “menu” structure, which presents callers with just a few valid options at a time and require a narrow range of responses from callers.
Many of these IVR systems now incorporate speech recognition technology. An example of a system incorporating speech recognition technology is described in U.S. Pat. No. 6,499,013. The robustness of the speech recognition technology used by IVR systems vary, but often have a predetermined range of responses that they listen for and can understand, which limits the ability of the end user to interact with the system in everyday language. Therefore, the caller will often feel that they are being forced to speak to the system “as though they are talking to a computer.” Moreover, even when interacting with a system that utilizes speech recognition, customer input is often either not recognized or incorrectly determined, causing the customer to seek a connection to a human customer service agent as soon as possible.
Human customer service agents continue to be used for more involved customer service requests. These agents may speak to the customer over the phone, respond to customer e-mails, and chat with customers online. Agents normally answer customer questions or respond to customer requests. Companies have customer service groups, which are sometimes outsourced to businesses that specialize in “customer relations management.” Such businesses run centers staffed by hundreds of agents who spend their entire working day on the phone or otherwise interacting with customers. An example of such system is described in U.S. Pat. No. 5,987,116.
The typical model of customer service interaction is for one agent to assist a customer for the duration of the customer's interaction. At times, one agent (for example, a technical support representative) may transfer the customer to another agent (such as a sales representative) if the customer needs help with multiple requests. But in general, one agent spends his or her time assisting that one customer for the full duration of the customer's call or chat session, or is occupied resolving the customer's issue via e-mail. Most call centers also expect the agent to take the time to log (document) the call. Deficiencies in this heavy agent interface model is (1) there is a high agent turnover rate and (2) a great deal of initial and ongoing agent training is usually required, which all add up to making customer service a significant expense for these customer service providers.
In order to alleviate some of the expenses associated with agents, some organizations outsource their customer service needs. One trend in the United States in recent years, as high-speed fiber optic voice and data networks have proliferated, is to locate customer service centers overseas to take advantage of lower labor costs. Such outsourcing requires that the overseas customer service agents be fluent in English. In cases where these agents are used for telephone-based support, the agent's ability to understand and speak clearly in English is often an issue. An unfortunate result of off shore outsourcing is misunderstanding and a less than satisfactory customer service experience for the person seeking service.
Improved interactive response systems blend computer-implemented speech recognition with intermittent use of human agents. To some extent, this has been done for years; U.S. Pat. No. 5,033,088 addresses a system using both a human attendant and an automated speech recognizer. Likewise, U.S. Pat. No. 7,606,718 discloses a system in which a human agent is presented with only portions of a call requiring human interpretation of a user's utterance. The contents of these patents, as well as all other art referred to herein, is hereby incorporated by reference as is fully set forth herein. Interest in such systems is enhanced if they are relatively low in cost, which generally calls for limited human interaction. To achieve such limited human interaction, it would be desirable to have a system that required minimal initial training and for which results continued to improve over time. In particular, a learning/training system that provides “day-one” performance that is suitable for production use and that improves in efficiency quickly over time would be particularly valuable.
Many existing ASR systems suffer from serious training constraints such as the need to be trained to recognize the voice of each particular user of the system or the need to severely limit recognized vocabulary in order to provide reasonable results. Such systems are readily recognizable by users as being artificial. Consider the difference between the typical human prompt, “How can I help you?” and the artificial prompt, “Say MAKE if you want to make a reservation, STATUS if you would like to check on status of a reservation, or CANCEL to cancel a reservation.”
Systems that are more ambitious, such as Natural Language Understanding (NLU) systems, require extensive machine learning periods in order to get usable results from larger grammars and vocabularies. Particularly in environments in which vocabulary may be dynamic (such as a system to take ticket orders for a new play or for a concert by a new musical group), the learning period may be far too long to provide satisfactory results. Inclusion of accents, dialects, regional differences in grammar and the like further complicate the task of teaching such systems so that they can achieve reasonable thresholds of recognition accuracy.
ASR systems currently available are effective at recognizing simple spoken utterances such as numbers, data, and simple grammars (i.e., a small set of words). However, to date ASR systems have not provided a high enough level of understanding to create a voice interface that provides a free-flowing conversation. Additionally, ASR performance degrades not only with accents and dialects as noted above, but also with background noise and, in many cases, female rather than male voices. ASR performance is improving over time, with some systems using statistical language models intended to recognize an extremely wide range of responses from callers, such that callers can be recognized even when they speak naturally rather than in a highly constrained manner. Even so, ASR performance has not yet rivaled actual interaction between humans, and the ASR systems that provide the highest levels of performance are time consuming and expensive to build and to tune for specific applications.
Tuning of grammars by considering statistical probabilities of various expected answers, as well as synonyms, is one technique used to improve ASR performance. Another is development of statistical language models, which can involve significant efforts transcribing recordings of utterances of live phone conversations with live operators. ASR performance is quite acceptable in certain applications but is not yet suitable for others, so known ASR-based systems continue to lack capability for understanding natural unconstrained utterances.
Therefore, there remains a need in the art for an interactive system that provides a consistently high-quality experience without the limitations of constituent ASR components.