The use of speech recognition in computer-based interactive applications has become more and more commonplace in everyday life. Today, a computer-based voice application can allow a telephone caller to direct the application to perform simple tasks through spoken utterances, such as connecting the caller with people in a telephone directory or retrieving airline flight information. Many companies have sought to expand or improve their customer service functions by using technology such as speech recognition to automate tasks that have traditionally been handled by human agents.
Conventional voice applications are well understood in the art, as disclosed for example in U.S. Pat. Nos. 6,173,266 issued to Marx et al. and 6,314,402 issued to Monaco et al., both of which are incorporated herein by reference. PRIOR ART FIG. 1 shows the call flow (100) of an example voice activated phone attendant application that can be used by a company to direct incoming phone calls. When a user calls the company, the application receives the call and outputs a greeting message, such as “Welcome to Company A” (110). The application then prompts the user to provide information (120) by listing options available to the user or by instructing the user on how to respond to the application, for example by providing the prompt: “If you know the name of the person you wish to speak to, please say the first name followed by the last name now. If you would like to speak to an operator, please say ‘Operator’ now.”
Next, the application waits for a response from the user (130) and then processes the response (140). For example, if the user says “Chris Brooks” the application needs to recognize this user utterance and determine if there is a Chris Brooks to whom the call should be transferred. A robust application should be designed to also recognize common variations of names, such as “Christopher Brooks.” If the application finds a match to the user utterance, the application prompts the user for confirmation by providing output such as: “Do you mean ‘Chris Brooks’?” (150). The application waits to receive a confirmation response from the caller (160), processes the response (170), and then acts upon the processed response (180), such as by transferring the call to the designated recipient and informing the caller of this action.
PRIOR ART FIG. 2 shows a flowchart (200) that provides more detail in the processing of a user utterance, such as in step 140 of the example voice application of FIG. 1. First, the audio waveform of the user utterance is recorded (210), and a phonetic representation of the waveform is created (220). Next, the phonetic representation of the utterance is compared to entries in a database of vocabulary words or phrases recognized by the application to generate a hypothesis of what the user said and a confidence level that the hypothesis is correct (230). In this example, the hypothesis is categorized as a high confidence hypothesis (240), a low confidence hypothesis (250), or a null hypothesis (260). Depending on whether a hypothesis is generated and the level of confidence, the application can reprompt the user (270), ask the user to confirm the hypothesis (150), or proceed directly to take appropriate action (180). For example, if the processing of the user utterance leads to a high confidence hypothesis (240), the example phone attendant application can directly transfer the caller to the requested recipient (180) and omit the confirmation and related steps (150, 160, 170).
Because people communicate naturally via speech, speech recognition systems have become viewed as a promising method for automating service functions without requiring extensive changes in user behavior. To achieve this vision, speech recognition systems should allow a user to ask for and provide information using natural, conversational spoken input. Recent advances in certain areas of speech recognition technology have helped alleviate some of the traditional obstacles to usable speech recognition systems. For example, technology advances have enabled unrehearsed spoken input to be decoded under a wider range of realistic operating conditions, such as background noise and imperfect telephone line quality. Additionally, recent advances have allowed voice applications to recognize voice inputs from a broader population of users with different accents and speaking styles.
However, despite such recent advances, conventional speech recognition systems have not provided adequately natural and conversational speech interfaces for users, and therefore the effectiveness of such systems, and the perception of and willingness to adopt such systems by users, has been severely limited.
In particular, understanding arbitrary speech from a human user has been a difficult problem. The acoustic signals related to common speech contain an overlap of phonetic information that cannot be decoded perfectly without knowledge of the context of the conversation, and in turn, knowledge of the real world. Therefore, computer-based speech recognition provides probabilistic results, relying on data-driven statistical approaches to determine a hypothesis (or small set of hypotheses) that has the highest posterior probability for matching the input audio signal. A description of the current state-of-the-art in speech recognition systems may be found in X. Huang, A. Acero, H. Hon, Spoken Language Processing, Prentice Hall, New Jersey, 2001, and M. Padmananbhan, M. Picheny, “Large-Vocabulary Speech Recognition Algorithms”, IEEE Computer, April 2002.
To maintain high levels of recognition accuracy, the user's input must typically be constrained by limiting both the vocabulary of allowed words and the way in which sentences can be formed. These constraints are expressed by a grammar, a set of rules that defines valid sentences as a structured sequence of words and phrases. For example, to recognize user responses to the question “Tell me the name of the person you'd like to call” (for a sample voice activated phone attendant application), the application developer might define the following variations:                [Name]        I want to talk to [Name]        I want to call [Name]        I want to speak with [Name]        I'd like to get [Name] please        
The difficulty with the above practice is that if the user makes a response that is not exactly matched by one of the predefined rules (e.g. “Can you get me John Smith if he's in the office?”), the application will not recognize it (an out-of-grammar condition), and will have to reprompt the user, who may not understand why his or her response was not recognized. The out-of-grammar rates can be quite high unless the application developer is knowledgeable enough to predefine all the common linguistic variations that might be uttered by a user.
Alternatively, the prompt must be very detailed to guide and restrict the user response, e.g. “If you know the name of the person you wish to speak to, please say only the first name followed by the last name now. If you would like to speak to an operator, please say ‘Operator’ now”. This technique is awkward, lengthy, and sounds un-natural to most callers. Moreover, a user's response can still be highly variable and hard to predict, and can contain disfluencies such as re-starts and pauses (uhm and uh). Despite these limitations, the use of grammars is common in current voice applications, and most developers are familiar with grammars and able to write and understand grammars of reasonable complexity.
One alternative approach to using pre-defined grammars in handling variations in user responses is an n-gram language model. An n-gram model does not rely on predefining all valid sentences; instead, an n-gram model contains information on which words are more likely to follow a given sequence of (n−1) words. An n-gram model does not enforce a sentence structure, and can assign a probability to a sentence even if it is ungrammatical or meaningless in normal usage. If the probability of a word depends only on the immediately preceding word, the model is known as a bigram. If the probability of a word depends on the previous two words, the model is known as a trigram. An n-gram language model is usually derived by counting word sequence frequencies from a training corpus—a large set of training texts that share the same language characteristics as the expected input. For example, a bigram model for a flight reservation application might specify that the word “to” has a much higher probability of following the word “fly” than the word “will”, since a sample of user utterances in this context would have a higher frequency of the word sequence “fly to” than the word sequence “fly will”. With a sufficient training set size, n-gram models can be built to recognize free-style speech.
However, there are several disadvantages to using n-gram models. First, n-gram models are not as familiar as grammars to most current voice application developers, and cannot be represented in as concise a human-readable form as grammars. Second, n-gram models need to be trained by a large number of samples (many tens of thousands—or up to millions) to achieve adequate levels of accuracy. This training requirement significantly limits the speed in which these systems can be deployed. Furthermore, typically the training samples must be obtained by collecting utterances from an already deployed speech recognition system. Therefore, n-gram models cannot be easily used in building a new voice application that does not have a detailed record of user utterances.
There is a need for a system and method that overcomes the above problems, as well as providing additional benefits.