1. Field of the Invention
This invention relates to the field of declarative markup languages for describing speech applications as state machines. More specifically, the invention relates to improved methods and systems for solving speech recognition problems in such a programming language.
2. Description of the Related Art
Prior to the advent of VoiceXML (Voice Extensible Markup Language) and its precursor languages, VoxML, SpeechML, and other, speech applications were described (or programmed) using standard programming techniques, e.g. C/C++ programs that made function (or object invocation) calls to lower level device drivers and speech recognition engines. For example, companies such as Nuance Communications, Inc., Menlo Park, Calif., and SpeechWorks International, Inc., Boston, Mass., have developed sophisticated automated speech recognition (ASR) systems and provide complex C/C++ interfaces called software development kits (SDKs) to allow customers to develop systems.
Both companies have also provided higher level building blocks (and development tools) for speech applications. However, these approaches are vendor specific, e.g. a C program designed for the Nuance SDK would not necessarily run with each using the SpeechWorks SDK, and vice versa.
Tellme (as well as other companies such as AT&T, Lucent, IBM, and Motorola) investigated the use of declarative markup languages to describe applications as state machines. AT&T, Lucent, IBM & Motorola ultimately each took declarative markup languages for speech they created separately and proposed a common standard, VoiceXML, that has been submitted to standards bodies, e.g. the World Wide Web Consortium (W3C).
The advantage of using a language such as VoiceXML is that application programmers can describe their application without regard to a specific ASR. Thus, a type of platform independence of the type seen on the World Wide Web with hypertext markup language (HTML) is possible.
However, one disadvantage is that application programmers are limited to the feature set of VoiceXML and the ability to access vendor-specific features is limited. The state-machine model used by VoiceXML in turn leads to several problems surrounding the ability to handle list navigation, false accepts, and other features. For example, the voice application state machines defined by the language support “barge in” (allowing a user to speak before queued audio prompts are finished playing), but the language does not expose information about the point in time at which the barge in occurred to the application programmer.
Early uses of VoiceXML at Tellme Networks, Inc., attempted to address the “shoot the duck” problem (hereinafter described) using a variety of ECMAScript (better known as JavaScript) variables to create and start timers. However, the execution model of VoiceXML is such that prompt playback timing is independent of interpretation timing, hence the foregoing method only results in a rough approximation and requires that the programmer have access to, or prior knowledge of the length of every prompt. To better understand this consider the following, extremely small VoiceXML code fragment:
<form><var name=“starttime”/><var name=“endtime”/><block><audio src=“file1.wav”>File 1 here</audio><assign name=“starttime” expr=“current.time( )”/><audio src=“file2.wav”>File 2 here</audio></block><field name=“foo” type=“boolean”> <property name=“timeout” value=“0”/> <filled><assign name=“endtime” expr=“current.time( )”/><assign name=“duration” expr=“endtime −starttime”/></filled></field></form>The time returned would be milliseconds of playback timing after the mark was encountered in the prompt playback queue. Accordingly, the application programmer thinks she has recorded the starting time for the playback of the second prompt, e.g. “file2.wav” and may plan to set a second variable, endtime, to the current time when “#state2” is entered to then compute the time for barge in through subtraction, but in actuality though, the VoiceXML execution model is such that all of the JavaScript for the current state is executed while the prompts are being cued.
Thus, while subtraction of starttime and endtime JavaScript variables would result in a fairly good approximation of the time from the start of all audio playback for a given VoiceXML state and the entry into the next VoiceXML state, it will not be relative to the apparent position of the <van/> declaration in the code or the second prompt. Thus to perform any calculations about barge-in it would be necessary to know the playback time of all audio prompts for the previous VoiceXML state. This may be impossible to determine in the interpreter if speed-adjusting technologies are used to increase playback speeds and reduce pauses between words. Thus the apparent file size/sampling rate may not be the same as playback time.
Accordingly, what is needed is a method and system for addressing the above problems.