The field of the invention relates, in general, to speech recognition, and more particularly to a method and apparatus for providing speech recognition resolution in the database layer.
A Voice application written for example in VoiceXML (a derivative of the Extensible Markup Language (XML)) processes spoken input from a user through the use of grammars, which define what utterances the application can resolve. VoiceXML (VXML) allows a programmer to define a “graph” that steps a user through a selection process—known as voice dialogs. The user interacts with these voice dialogs through the oldest interface known to mankind: the voice. Hence, VoiceXML is a markup language for building interactive voice applications which, for example, function to provide recognition of audio inputs such as speech and touch-tone Dual Tone Multi-Frequency (DTMF) input, play audio, control a call flow, etc.
A VoiceXML application comprises a set of VoiceXML files. Each VoiceXML file may involve one or more dialogs describing a specific interaction with the user. These dialogs may present the user with information and/or prompt the user to provide information. A VoiceXML application functions similar to an Internet-based application accessed through a web browser, in that it typically does not access the data at a dial-in site but often connects to a server that gathers the data and presents it. The process is akin to selecting a hyperlink on a traditional Web page. Dialog selections may result in the playback of audio response files (either prerecorded or dynamically generated via a server-side text-to-speech conversion).
Grammars can be used to define the words and sentences (or touch-tone DTMF input) that can be recognized by a VoiceXML application. These grammars can, for example, be included inline in the application or as files, which are treated as external resources. Instead of a web browser, VoiceXML pages may be rendered through Voice Gateways, which may receive VoiceXML files served up by a web or application server as users call in to the gateway.
Voice Gateways typically comprise seven major components, as follows: a Telephony Platform that can support voice communications as well as digital and analog interfacing, an Automated Speech Recognition (ASR) Engine, a Text To Speech synthesis (TTS) engine, a Media Playback engine to play audio files, a Media Recording engine to record audio input, a Dual Tone Multi-Frequency (DTMF) Engine for touchtone input, and a Voice Browser (also known as a VoiceXML Interpreter). When a VoiceXML file is rendered by a Voice Gateway, the grammars may be compiled by the ASR engine on the Voice Gateway.
The resolution capabilities of standard ASR engines are often fairly limited because performance in resolving utterances declines quickly with size, typically limiting grammar sizes to the order of a few thousand possible utterances. In the past, this problem with using large grammars for applications such as directory automation services was sometimes addressed through the use of specialized large scale speech recognition technology capable of efficiently resolving greater than a few thousand utterances. This technology often involves hardware and software solutions, which included a telephony interface, resource manager, specialized ASR and TTS engine, customized backend data connectivity, and proprietary dialog creation environments integrated together in one package. The specialized ASR in these packages is sometimes capable of resolving grammars with millions of allowable utterances. However, this specialized hardware and software solution has many drawbacks, for example it does not take advantage of the centralization of data and standardization of data access protocols. For example, a data synchronization problem can arise when a set of data (such as a corporate directory) is stored in one location by a enterprise and is replicated by the specialized solution. This problem can occur because any time the underlying data set changes (e.g. due to a hiring, firing, etc.), the replicated data state also needs to be refreshed. Furthermore, these specialized systems often create a requirement that the call flow elements of a large-scale speech recognition application must be designed as part of the proprietary dialog creation environment, which effectively makes these applications non-portable. Furthermore, utilization of these specialized systems often locks users into the particular TTS engines and telephony interfaces provided as part of the specialized system, further reducing the ability to switch implementations of the underlying large-scale speech recognition technology.
Enabling large-scale grammar resolution through an application server has been proposed to resolve some of these drawbacks. Specifically, enabling large-scale grammar resolution in the application server can allow the information and data resources that will make up the large scale grammar to remain in a centralized location. Application servers make use of a variety of industry standard data connectivity methods and protocols. Taking advantage of this data centralization allows for reduced (though not eliminated) duplication of data and memory state. Additionally, by consolidating large-scale grammar resolution through an application server, administration of the large-scale search technology can be simplified. Large-scale grammar resolution through an application server can also allow application developers to write their applications in any language supported by the application server, rather than in the proprietary format of third-party Dialog Creation Environments. Application developers can therefore make use of standard programming conventions, execution models, and APIs (Application Programming Interfaces) when writing their applications. Problems remain with the application server approach, however, including data state replication, data synchronization and recognition result return problems.
Although large-scale speech recognition through the application server solves some problems, further advantages can be gained by providing the large scale speech recognition in the database layer. In one such approach, for example, each database row can have an additional key that can be used to access it when the key corresponds to a sound or utterance. Thus, the large scale grammar resolution engine can be integrated with the data structures used to store data in the database. As a result, data from a table in the database can be selected by voice by performing automatic speech recognition of an utterance from the user against any set of data within the database. This approach can also permit the dataset to be synchronized with the grammar to be resolved and could enable users to search any table via voice without the overhead of initializing or priming a dataset within a specialized automatic voice recognition engine. One implementation, for example, could augment a relational database with voice access as an additional mode for accessing the data.