The present invention is directed to a method and apparatus for processing prompt streams in a telephony system and, more specifically, to a method and apparatus for processing prompt streams including computer program instructions.
A variety of automated systems have been developed that interact with people over the telephone. For example, commercial services exist that provide automated stock quotes and other financial information over the telephone. Such telephony systems implement a sequence of dialogs between a person (the xe2x80x9cuserxe2x80x9d) on one end of the telephone connection, and the automated telephony system on the other end of the telephone connection. The telephony system plays audio output to the user. This output may, for example, consist of recorded announcements, tones or other generated sounds, or synthesized speech generated by a xe2x80x9ctext-to-speechxe2x80x9d engine. A single unit of such audio output is referred to as a xe2x80x9cprompt.xe2x80x9d The automated system conveys information to the user by xe2x80x9cplayingxe2x80x9d a sequence of prompts (referred to as a xe2x80x9cprompt streamxe2x80x9d) in an appropriate order. For example, to convey the sentence xe2x80x9cA message was received from John Smith at 10:15 today,xe2x80x9d the system might play the following prompts in order: (1) a recording of the words xe2x80x9cA message was received from,xe2x80x9d (2) synthesized speech for the name xe2x80x9cJohn Smithxe2x80x9d generated by a text-to-speech engine, (3) a recording of the word xe2x80x9cat,xe2x80x9d (4) a recording of the word xe2x80x9cten,xe2x80x9d (5) a recording of the word xe2x80x9cfifteen,xe2x80x9d and (6) a recording of the word xe2x80x9ctoday.xe2x80x9d
The user is typically allowed to respond to the telephony system in any of a variety of ways, such as by pressing one or more DTMF (touch tone) keys, by hanging up, by flashing the telephone switchhook, or, in a system that is capable of speech recognition, by speaking or making other noises that are recognizable by the system. The user may also do nothing, leading to what is referred to as a xe2x80x9ctimeout.xe2x80x9d In a typical interaction, the dialog implemented by the telephony system consists of alternating actions by the user and the system; e.g., the system plays a prompt stream, the user responds, the system plays another prompt stream based on the user""s response, and so forth, until either the user or the system terminates the dialog by hanging up. Note that in some cases a hang up by the user may not be voluntary, such as when a cellular phone connection is unexpectedly dropped due to interference or some other problem with the connection.
Some telephony systems allow the user to interrupt the system""s audio output. Such an interruption is referred to as xe2x80x9cbarge-inxe2x80x9d (also referred to as xe2x80x9ccut-throughxe2x80x9d). This feature may be used to provide a more user-friendly interface. For example, a user who is already familiar with the operation of the telephony system can barge-in on the prompt stream to respond without waiting for the prompt stream to complete, making dialogs complete more quickly and feel more natural to the user. Some telephony systems allow barge-in to be turned on or off by the system or by the user as desired. Some systems allow the user to barge-in with DTMF but not with voice input. Hang up by the user while a prompt stream is playing is also typically considered to be a form of barge-in.
Conventional automated telephony systems are typically controlled by software that is designed to operate in accordance with the xe2x80x9cprompt queuexe2x80x9d model. In such a model, the application program that controls the telephony system sequentially stores prompts in a prompt queue (a first-in first-out list). The telephony system typically provides a software interface through which the application program can manage the prompt queue. The software interface typically provides a variety of methods for adding prompts to the prompt queue. For example, the interface typically allows the application to supply a text string to be added to the prompt queue, in which case a text-to-speech engine converts the text string into a digital audio stream that is added to the prompt queue in the form of an audio file. The interface may also allow the application program to supply an audio file to be added directly to the prompt queue. Regardless of the method that the application program uses to add prompts to the prompt queue, all prompts stored in the prompt queue are typically stored in the form of audio files suitable for playback to the user. The telephony system""s software interface also typically provides a method for playing the prompts in the prompt stream. The application program uses this method to sequentially play the prompts in the prompt queue. The prompts are played to the user over the telephone and removed from the prompt queue as they are being played.
The prompt queue model provides a simple interface to the telephony system that makes it easy for the application program to generate and play prompts to the user. The application programmer who develops an application program according to the prompt queue model need not know how the underlying components, such as the text-to-speech engine and the speech recognition engine, work. Rather, the application programmer need merely know how to use some straightforward commands for manipulating the prompt queue (e.g., commands to add prompts to the prompt queue) and for causing the prompts in the prompt queue to be played to the user. The telephony system""s software interface shields the application program (and the application programmer) from communication with low-level components such as the text-to-speech engine, the speech recognition engine, and the audio hardware.
Conventional systems using the prompt queue model, however, have a number of problems, some of which result at least in part from the abstraction provided by the prompt queue model. For example, in such a system, it is difficult to design an application program to perform an action at a predetermined time during playing of the prompt stream to the user. Once an application program instructs the telephony system to play the prompts in a prompt queue, the telephony system plays the prompts without further intervention from the application program. Furthermore, the telephony system does not provide the application program with any information about the time at which a particular prompt in the prompt queue is played to the user. It is therefore difficult for the application program to determine precisely when a particular word, for example, in the prompt stream is being played. This can make it difficult for the application program to perform an action that must be performed at a particular time while the prompt stream is playing. One reason for this difficulty is that, as described above, the application program can provide prompts in the form of text strings which are converted into audio by a text-to-speech engine. Once the text in such prompts is converted to speech, the application program does not have any information about the temporal position of particular words within the prompt.
Similarly, it is difficult to design application programs for such systems which can accurately and reliably determine when an event occurred during playing of a prompt stream. For example, it is difficult to design application programs that can accurately and reliably determine when a user barged in with input (such as a DTMF keypress) during playing of a prompt stream. Furthermore, even if the application program is provided with the time at which barge-in occurred, it may be difficult for the application program to determine which prompt was being played at the time of barge-in.
More generally, it is difficult to guarantee that application programs in such systems will perform as desired in the face of the wide variety of asynchronous interactions that may occur between the prompt stream, the user, and external events. Such asynchronous events include, for example, any events that occur at unpredictable times, such as barge-in or the arrival of a new e-mail message addressed to the user. For example, in many cases where barge-in is available, the desired behavior of the system changes when barge-in occurs. Furthermore, the correct behavior of the system may depend upon the precise instant at which the barge-in occurred. For example, it may be necessary for the system to repeat a critical message that was not played in its entirety because the user interrupted it by barging in. Although some systems attempt to solve this problem by disallowing barge-in during such messages, it is still necessary to detect an unexpected hang up in such a situation.
Asynchronous events from outside the system can affect the desired behavior of the system in complex ways. For example, consider a voice-controlled messaging system in which the user is allowed to speak commands such as xe2x80x9cnext messagexe2x80x9d and xe2x80x9csave message.xe2x80x9d In such a system, messages from outside the system may arrive at any time. Suppose that when a new message arrives, the desired behavior of the system is to (a) wait until the user hears the end of the prompt currently being played, (b) play xe2x80x9cA new message has arrived; do you wish to hear it now?xe2x80x9d, and then (c) get a yes/no reply from the user and take appropriate action in response.
Implementing this desired behavior is complicated by the fact that the user may barge-in with another command before step (b) has been performed. Suppose, for example, that the user is listening to a prompt stream when the new message arrives, but before hearing the end of the prompt stream the user barges in by saying xe2x80x9ctell me the time.xe2x80x9d Although it would be possible to simply notify the user about the new message instead of playing the time, it might be more desirable to inform the user of the current time, deferring notification about the new message until later. For example, it might be desirable for the system to say: xe2x80x9cThe time is now 6:45. A new message has arrived . . . xe2x80x9d.
The situation is further complicated by the requirements of typical speech recognizers, which need to be pre-loaded with a description of all commands that the user is permitted to say at a particular time. Such a description of all permissible commands is referred to as a xe2x80x9cgrammar.xe2x80x9d In this example, it is desirable for the grammar to contain all standard user commands (xe2x80x9ce.g., xe2x80x9cnext messagexe2x80x9d) up to the instant at which the system reports the new message, at which point the grammar must be modified to additionally accept the possibilities xe2x80x9cyesxe2x80x9d and xe2x80x9cno.xe2x80x9d In other words, to implement the desired behavior it is desirable to change grammars at a particular point in time while the prompt stream is playing. As described above, in conventional systems it is difficult to design application programs to perform actions at a particular point in time while the prompt stream is playing.
What is needed, therefore, is a system that facilitates development of application programs, for use in telephony systems, that can handle a variety of asynchronous interactions between the system, the user, and external events.
One illustrative embodiment of the present invention is directed to a method for use in a telephony system. The method includes acts of: (A) inserting into a prompt stream at least one voice prompt to be played to a user of the telephony system when the prompt stream is processed; and (B) inserting into the prompt stream at least one active prompt associated with computer program instructions to be executed when the prompt stream is processed.
Another illustrative embodiment of the present invention is directed to a method for processing a prompt stream in a telephony system. The method includes acts of: (A) playing a first audio stream associated with a first voice prompt in the prompt stream; and (B) executing computer program instructions associated with at least one active prompt in the prompt stream.
A further illustrative embodiment of the present invention is directed to a computer-readable medium encoded with a program for execution on a host computer in a telephony system. The program, when executed on the host computer, performs a method including acts of: (A) inserting into a prompt stream at least one voice prompt to be played to a user of the telephony system when the prompt stream is processed; and (B) inserting into the prompt stream at least one active prompt associated with computer program instructions to be executed when the prompt stream is processed.
Yet another illustrative embodiment of the present invention is directed to a computer-readable medium encoded with a program for execution on a host computer in a telephony system. The program, when executed on the host computer, performs a method for processing a prompt stream in the telephony system. The method includes acts of: (A) playing a first audio stream associated with a first voice prompt in the prompt stream; and (B) executing computer program instructions associated with at least one active prompt in the prompt stream.
Another illustrative embodiment of the present invention is directed to a telephony system including a storage device to store a prompt stream including a stream of prompts; and a controller to insert into the prompt stream at least one voice prompt to be played to a user of the telephony system when the prompt stream is processed, and to insert into the prompt stream at least one active prompt associated with computer program instructions to be executed when the prompt stream is processed.
Yet another illustrative embodiment of the present invention is directed to a telephony system including a storage device to store a prompt stream including a stream of prompts; and a controller to play a first audio stream associated with a first voice prompt in the prompt stream, and to execute computer program instructions associated with at least one active prompt in the prompt stream.