1. Field of the Invention
The present invention is directed to processing user input in a menu-based, speech actuated system and, more particularly, to recognizing speech input in response to a prompt and respectively applying words or phrases in a single user response to a sequence of prompts in a menu.
2. Description of the Related Art
Automated systems that respond to user input from a telephone accept either key-pad, usually dual-tone multifrequency (DTMF), signals or speech input, or both. Such systems are often referred to as interactive voice response (IVR) systems. For both types of inputs, prompts are generated by the system using synthesized or recorded speech. Many of these systems interact with users based on a menu structure that defines the prompts to be generated and a set of user commands that will be accepted at any instant in the user interaction.
Many IVR systems use DTMF, or touchtone, keys to allow a user to give input to an interactive phone-based system. Typically, audio is played which presents the user with a set of options, each corresponding to a particular DTMF key. Some of these options represent commands and some options navigate the caller to further menus.
It is common for systems that accept DTMF signals to permit users to enter a series of inputs, such as “1, 1 and 2” in response to a prompt requesting the input of only a single digit. The subsequent digits, “1 and 2,” are interpreted as responses to, respectively, the prompts that would have been generated after receiving the initial “1” and the second “1”. This is ordinarily referred to as “type ahead” capability. As users become familiar with DTMF based systems, and memorize the keys that correspond to certain commands, they begin to use “type-ahead”. Type-ahead allows users to type a sequence of keys and thereby execute a sequence of commands and navigations without listening to the intervening prompts. This feature provides a substantially faster interface for the experienced user, and users have come to expect this feature from IVR systems.
Many speech-based systems are structured in much the same way as these IVR systems are. As in DTMF systems, the user has a small number of options at any given time. They are presented with menus, and can say one of a small number of words which perform commands or navigate to submenus, so that the user can say “play”, for example, rather than press 1. However, conventional systems that accept speech input do not have a “speak ahead” capability similar to “type ahead” to process speech input as responses to prompts that have not yet been generated.
The way grammar-based speech recognition engines typically work means that “speak-ahead” will not work. If the user says “next next play”, this whole utterance will be matched against the current grammar which may only have “next”, “play” and other single words in it. Thus, the user has to say “next”, pause a sufficient amount of time for the word to be recognized as a full utterance, say “next”, pause, and say “play”. This is inconvenient for the expert user.
It is more difficult to have “speak ahead” capability than “type ahead” capability because speech input is more difficult to recognize than keypad input. Recognition accuracy of DTMF is essentially 100%, while conventional systems that accept speech input have more difficulty recognizing words. Accepting speech input from any user, i.e., without prior training to individual voices, increases the difficulty of determining which response was input. Therefore, the set of permissible utterances is normally limited as much as possible to achieve the best recognition accuracy. Thus, a DTMF detection module can listen for all DTMF sequences, even invalid sequences, while in a speech-based system, listening for invalid command sequences reduces the recognition accuracy.