Speech has been viewed as a promising interface over the traditional keyboard and mouse. With wide acceptance of portable devices such as PDAs and cell phones that employ very small input interfaces, a more robust interface is needed to allow a user to access the ever growing pool of content and information that is accessible via a telephone or a portable device.
However, the very nature of a portable device is its convenience, which typically requires the portable device to be relatively small in physical size. Unfortunately, such requirements often constrain the processing power and the characteristics of input/output interfaces on a portable device. For example, it is generally impractical to provide a physical keyboard on a cell phone. Although an electronic keyboard can be displayed on a screen as in a PDA, such a user interface is unwieldy in performing complex tasks. Additionally, the user may be distracted while operating the portable device in this way, e.g., while operating a vehicle.
Thus, a speech-driven user interface is very desirable. To address this need, the VoiceXML Forum developed the Voice extensible Markup Language (VoiceXML) which is a new computer language that can be used to create audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and touchtone key input, recording of spoken input, telephony, and mixed-initiative conversations.
FIG. 1 illustrates a traditional VoiceXML architectural model 100 for integrating voice services with data services. A voice service is a sequence of interaction dialogs between a user and the implementation platform 130. In turn, the implementation platform 130 is controlled by the VoiceXML interpreter context 120 and the VoiceXML interpreter 122. The implementation platform 130 generates events that are in response to user actions, e.g., spoken or character inputs. These events are then acted upon by the VoiceXML interpreter context 120 or the VoiceXML interpreter 122 as specified by the “VoiceXML document”. Namely, a VoiceXML document specifies each interaction dialog that a VoiceXML interpreter conducts. Finally, the document server 110 (e.g., a Web server), processes requests from the VoiceXML interpreter and produces the requested documents or additional VoiceXML documents to continue the user's session.
In practice, the standard VoiceXML model for a dialog system consists of some number of speech recognition (SR) grammars competing to recognize the user's utterances. The recognized utterance and the identity of the SR grammar which best recognized the utterance are returned to the VoiceXML application (i.e., the VoiceXML interpreter), where control flow through the application is (partially) dictated by which grammar did the recognition.
However, the standard VoiceXML model has two drawbacks pertaining to the nature and number of speech recognition grammars. First, the VoiceXML model typically employs small rule-driven speech recognition grammars that are less robust than non-rule-driven statistical language models.
Second, the standard VoiceXML model has a limited number of SR grammars competing to recognize a current utterance. These grammars are typically small in size and typically only a small number of them are competing at any one time. The result of the small size and small number of SR grammars is that VoiceXML applications are generally not very flexible. Users are limited in what they can say to a small number of phrases, and they can only say those phrases when the system is expecting them.
Therefore, a need exists for a speech recognition and natural language understanding dialog system within the context of a VoiceXML environment that provides sufficient flexibility to handle a wide variety of user utterances and is capable of understanding any of the utterances at any time.