1. Field of the Invention
The present invention relates to spoken dialog systems and more specifically to a system and method of augmenting spoken language recognition and understanding by correcting common errors in linguistic performance.
2. Introduction
Spoken dialog systems have several main components or modules to process information in the form of speech from a user and generate an appropriate, conversational response. FIG. 1 illustrates the basic components of a spoken dialog system 100. The spoken dialog system 100 may operate on a single computing device or on a distributed computer network. The system 100 receives speech sounds from a user 102 and operates to generate a response. The general components of such a system include an automatic speech recognition (“ASR”) module 104 that recognizes the words spoken by the user 102. AT&T's Watson ASR component is an illustration of this module. A spoken language understanding (“SLU”) module 106 associates a meaning to the words received from the ASR module 104. A dialog management (“DM”) module 108 manages the dialog by determining an appropriate response to the customer question. AT&T's Florence DM engine is an example of this module. Based on the determined action, a spoken language generation (“SLG”) module 110 generates the appropriate words to be spoken by the system in response and a Text-to-Speech (“TTS”) module 112 synthesizes the speech for the user 102. AT&T's Natural Voices TTS engine provides an example of the TTS module. Data and rules 114 are used to train each module and to process run-time data in each module.
A key component in achieving wide-spread acceptance of interactive spoken dialog services is achieving a sufficiently high a percentage correct interpretations of requests spoken by callers. Typically, the ASR module 104 uses statistical models of acoustic information to recognize patterns as semantic units such as words and phrases. The patterns are typically matched against large or specialized dictionaries of words that are found in general or restricted contexts. In general, the smaller the set of accepted target words the greater the recognition accuracy.
However, a common problem arises when the speaker or user of the system does not speak in a fluent manner. For example, the user may say “I . . . um . . . um . . . am interested in . . . ah . . . my checking . . . I mean savings . . . account balance.” What is needed in the art is an approach to correctly recognizing and understanding what a caller means to say when the caller has said something different than what this caller intended because of disfluencies, or slips of the tongue.