This application claims the benefit of U.S. Provisional Application No. 60/425,178, filed Nov. 8, 2002 the entire content of which is hereby incorporated by reference. The field of the invention relates, in general, to speech recognition, and more particularly to a method and apparatus for providing speech recognition resolution on an application server.
A Voice application written for example in VoiceXML (a derivative of the Extensible Markup Language (XML)) processes spoken input from a user through the use of grammars, which define what utterances the application can resolve. VoiceXML (VXML) allows a programmer to define a “graph” that steps a user through a selection process—known as voice dialogs. The user interacts with these voice dialogs through the oldest interface known to mankind: the voice. Hence, VoiceXML is a markup language for building interactive voice applications which, for example, function to provide recognition of audio inputs such as speech and touch-tone Dual Tone Multi-Frequency (DTMF) input, play audio, control a call flow, etc. . . .
A VoiceXML application comprises a set of VoiceXML files. Each VoiceXML file may involve one or more dialogs describing a specific interaction with the user. These dialogs may present the user with information and/or prompt the user to provide information. A VoiceXML application functions similar to an Internet-based application accessed through a web browser, in that it typically does not access the data at a dial-in site but often connects to a server that gathers the data and presents it. The process is akin to selecting a hyperlink on a traditional Web page. Dialog selections may result in the playback of audio response files (either prerecorded or dynamically generated via a server-side text-to-speech conversion).
Grammars can be used to define the words and sentences (or touch-tone DTMF input) that can be recognized by a VoiceXML application. These grammars can, for example, be included inline in the application or as files, which are treated as external resources. Instead of a web browser, VoiceXML pages may be rendered through Voice Gateways, which may receive VoiceXML files served up by a web or application server as users call in to the gateway.
Voice Gateways typically comprise seven major components, as follows: a Telephony Platform that can support voice communications as well as digital and analog interfacing, an ASR Engine, a Text To Speech synthesis (TTS) engine, a Media Playback engine to play audio files, a Media Recording engine to record audio input, a Dual Tone Multi-Frequency (DTMF) Engine for touchtone input, and a Voice Browser (also known as a VoiceXML Interpreter). When a VoiceXML file is rendered by a Voice Gateway, the grammars may be compiled by the ASR (Automated Speech Recognition) engine on the Voice Gateway.
The resolution capabilities of standard ASR engines are often fairly limited because performance in resolving utterances declines quickly with size, typically limiting grammar sizes to the order of a few thousand possible utterances. In the past, this problem with using large grammars for applications such as directory automation services was sometimes addressed through the use of specialized hardware and software solutions, which included a telephony interface, resource manager, specialized ASR and TTS engine, customized backend data connectivity, and proprietary dialog creation environments integrated together in one package. The specialized ASR in these packages is sometimes capable of resolving grammars with millions of allowable utterances. However, this specialized hardware and software solution has many drawbacks, for example it does not take advantage of the centralization of data and standardization of data access protocols. Furthermore, these specialized systems often create a requirement that the call flow elements of a large-scale grammar application must be designed as part of the proprietary dialog creation environment, which effectively makes these applications non-portable. Furthermore, utilization of these specialized systems often locks users into the particular TTS engines and telephony interfaces provided as part of the specialized system, further reducing the ability to switch implementations of the underlying large-scale search technology.
Enabling large-scale grammar resolution through an application server resolves many of these drawbacks. Specifically, enabling large-scale grammar resolution in the application server can allow the information and data resources that will make up the large scale grammar to remain in a centralized location. Application servers make use of a variety of industry standard data connectivity methods and protocols. Taking advantage of this data centralization allows for reduced duplication of data and memory state. Additionally, by consolidating large-scale grammar resolution through an application server, administration of the large-scale search technology can be simplified.
Large-scale grammar resolution through an application server can further allow application developers to write their applications in any language supported by the application server, rather than in the proprietary format of a Dialog Creation Environment. Application developers can therefore make use of standard programming conventions, execution models, and APIs (Application Programming Interfaces) when writing their applications.