1. Field of the Invention
The present invention relates generally to speech recognition systems. More specifically, systems and methods for constructing a series of interactions with a user to collect multiple pieces of related information for the purpose of accomplishing a specific goal or topic (a multi-slot dialog) using a component-based approach are disclosed.
2. Description of Related Art
Speech recognition systems are a promising method for automating service functions without requiring extensive changes in user behavior. Many companies have sought to expand or improve their customer service functions by using speech recognition technology to automate tasks that have traditionally been handled by human agents. To achieve this, speech recognition systems should allow a user to ask for and provide information using natural, conversational spoken input. Recent advances in certain areas of speech recognition technology have helped alleviate some of the traditional obstacles to usable speech recognition systems. For example, technology advances have enabled unrehearsed spoken input to be decoded under a wider range of realistic operating conditions, such as background noise and imperfect telephone line quality. Additionally, recent advances have allowed voice applications to recognize voice inputs from a broader population of users with different accents and speaking styles.
Well-engineered voice systems achieve high customer acceptance. Unfortunately, building effective voice systems using past approaches has been difficult.
The earliest approaches required programming in the application program interfaces (APIs) of the speech recognition engine. These approaches burdened developers with low-level, recognition engine specific details such as exception handling and resource management. Moreover, since these APIs were specific to a particular recognition engine, the resulting applications could not be easily ported to other platforms.
The advent of intermediate voice languages such as VoiceXML as open standards somewhat simplified the development process. These intermediate voice languages accompanied a distribution of responsibilities in a voice system between a browser—which interprets the voice language and handles the telephony, voice recognition, and text-to-speech infrastructure—and a client application—which provides the user interaction code (expressed in the voice language). As a result, application developers no longer needed to worry about low-level APIs, but instead were responsible for generating documents that would be executed by the voice browser.
Even with these advances, however, developing voice applications remained complex for a number of reasons. For example, voice applications present a new user interaction model that is sufficiently distinct from the (well understood) graphical user interface to require specialized design and implementation expertise. Speech interface concepts, such as dialog management, grammar optimization, and multi-slot interfaces, are manually implemented in every custom-built voice system. Given the relative newness of the speech paradigm, this further burdens the developers. In addition, the demands on applications to handle presentation, business logic, and data access functions resulted in piecemeal architectures combining static and dynamically generated documents, backend servlets, grammars, and other disjoint components.
A number of products are available to simplify the development of enterprise voice applications. A central element of many of these products is a library of predefined and customizable voice components whose use reduces the amount of code that needs to be developed by a programmer. These components usually encapsulate the voice language code, grammars, internal call flows, prompts and error recovery routines required to obtain one piece of information from the caller, such as a date, a time, a dollar amount, a sequence of digits, or an item from a set or list of allowable items (such as a set of airports).
A major limitation of these component frameworks is that the components are not combinable to allow the user to provide multiple pieces of information in each utterance. For example, a flight reservation application could use four components: a departure airport, a destination airport, a departure date and a departure time. The existing frameworks would allow a user to provide the four pieces of information in four separate utterances. However, if the application were to allow the user to say the departure airport, destination airport and departure date in one utterance (e.g. “I'm flying from Boston to San Francisco on Monday”), the departure airport, destination airport, and departure date components could not be simply combined. Instead, a new component would need to be developed with new grammars, call flows, prompts, etc. to recognize the two airports and the date. To carry the example further, if the application were to allow the caller to retain some pieces of information while changing others pieces of information (e.g. “No, I'm actually flying to Oakland on Tuesday”), an even more complex component would have to be developed.
Because of these limitations, voice applications that rely on existing component frameworks implement highly directed dialogs in which the call flow is largely predetermined and each step accepts only a single item of information, such as in an interchange illustrated in FIG. 1a. Such voice systems are rigid and often penalize a caller who provides too much information, such as in an interchange illustrated in FIG. 1b. As a result, these systems are neither intuitive nor efficient since they cannot capture information rapidly or adapt to the user's preferences for providing information.
What is needed is a voice application that utilizes a more intuitive, rapid and natural approach for obtaining information from a user such as a caller.