With the proliferation of computer systems, an increasing amount of processing is becoming automated. At the same time, the processing power of such systems continues to evolve. To make use of this increasingly available processing capability, organizations are attempting to migrate functions historically performed by individuals, if at all, to automated systems. For instance, increasingly, computer systems are developed and used to engage humans via speech interaction. Some systems, as an example, are implemented to conduct interviews or surveys of individuals via a telephone, while other systems may interact with individuals without the use of a network. Additionally, as speech over the World Wide Web (the “Web”) and the Internet (e.g., voice over IP) becomes more and more commonplace, one can assume that human—computer speech based interaction will be increasingly conducted using that medium.
One typical example of human—computer speech based interaction is survey systems, wherein a computer conducts an automated speech based survey of an individual over a telephone. In such a case, the survey system may have a scripted survey (i.e., series of questions) to be asked of the individual. The survey system may ask a first question, as a prompt, and await (e.g., for 5 seconds) a response by the individual. If the survey system does not receive a response, or receives a response that it can not interpret, the survey system may ask the question again or provide an instructional type of feedback. If the survey system receives a response that it can interpret, the survey system goes on to ask a next question or present a next prompt.
Such human—computer systems usually include an automatic speech recognition (ASR) system that converts incoming acoustic information into useful linguistic units, such as words or phrases. In a transactional ASR, for example one operating over a telephone network, there are a set of allowed words and phrases, which are defined by grammars. The process of sorting through the grammars for a particular word or phrase usage is referred to as syntactic search, wherein the words and their order are determined, typically based on probability. Such syntactic search subsystems typically evaluate a word using a fixed start point and a fixed end point, and process that data to determine the word with a related probability. However, this approach tends to be inefficient since the timeframe between start and end points may be adequate for some audio inputs, but inadequate for others, where some data beyond an endpoint may be cutoff and in other cases more time may be spent on a word than is required. Additionally, if not yielding results above a certain threshold probability, such systems may backtrack and continue to process the audio input to improve the phonetic estimates. Otherwise, the system may just put forth a best guess, albeit with low confidence.
In such systems, typically audio inputs, whether speech or background noise, are processed as valid speech, for the most part. That is, such systems do not usually maintain sufficient contextual knowledge about the expected response to eliminate extraneous noises (or “barge in”). As a result, such systems may attempt to interpret such noises as speech, thereby producing a result having embedded errors or rejecting the result altogether.
Development of speech applications that utilize speech recognition (SR) systems, to create such human—computer systems, is generally an expensive, time-consuming effort that requires a multi-disciplinary team. The dominant approach to improving the ease of such application development has been to create Web-based applications using HTML extensions. For example VOXML. VoiceXML, and SpeechML are known types of extensions created specifically for SR systems. However, these approaches have been seriously limited in their ability to represent complex speech interactions, due to strong limitations in their coding power, as well as limitations on their control of, and access to, the underlying SR engines. That is, HTML is not a true programming language, but is rather a markup language. Therefore, it only provides a very limited framework, which is not particularly conducive to creating robust applications. Access to the speech recognition engines by such VoiceXML applications is limited by the bottlenecks of markup languages, such as the lack of programming language facilities, and fixed, predefined interfaces to the SR engine.
Such VoiceXML applications typically reside with a SR system on a voice portal (or gateway) that acts as a client to a Web server that provides back-end services to the VoiceXML application. The back-end services include standard Web services and, usually, custom software required by the VoiceXML application. For example, a back-end (i.e., server-side) product data servlet is typically included that is responsible for talking to back-end services, including converting received replies into XML. A product presentation servlet is typically also included at the server-side. This servlet is used to put content in a format required by the VoiceXML application (or client). A repository of VoiceXML specific XSL templates resides at the back-end and defines the formats used by the product presentation servlet. A product service is also provided at the back-end that manages the dissemination of product-related information, for example, to facilitate product browsing. And, a product database used by the various server-side servlets and services also resides at the back-end.
This approach of a strong reliance on back-end, server-side services is required with such VoiceXML applications, since VoiceXML applications are not, themselves, capable of delivering complex and robust functions.