Interactive telephone systems are used to provide users access to a variety of services without the necessity of requiring a live operator to interface with the user. Initially such systems were of the type that played an audio message and then reacted upon sensing the user's response via entering one or more key strokes on a touch tone telephone keypad to generate well known dtmf tones. It is well recognized that such systems are not preferred for a variety of reasons including that they are cumbersome to use and further because this type of response is not natural for most users. Particularly, it is undesirable for a user to listen to a series of response choices and determine which of the `programmed responses` is most appropriate while concurrently remembering which of the several keystrokes must be used to elicit that response.
With the advent of natural language recognition systems, users could respond to interactive telephone systems using more natural spoken responses. Such systems are used for a variety of applications. One known example is for providing information and services regarding flight availability, flight times, invoicing, payments and flight reservations and the like for a predetermined airline. Another well known use for such systems includes gaining information regarding stocks, bonds and other securities, purchasing and selling such securities, and gaining information regarding a user's stock account. Also, systems exist for controlling transactions in accounts at a bank. Other applications are also used.
FIG. 1 shows a finite state diagram of an exemplary prior art natural language interactive telephone system for an airline application such as that set forth above. In state 100 the application receives an incoming telephone call. This is being welcome in which the system plays a greeting to the caller during which the system identifies itself to the user. The state 102 represents the main menu of the airline interactive telephone system. During this state the system asks the user which of its several modes the user wishes to invoke. For example, the user could choose to obtain flight scheduling information in a first mode 104. Alternatively, the user could choose to learn about availability of seats for given flights in a second mode 106. Also, the user could choose to learn about actual arrival times for in-the-air flights in a third mode 108. The user can also use the system to purchase ticket in state 110. Additionally, the user could obtain information regarding the cost of flying between two cities in the state 112. Such a system could be configured to have many other modes for achieving other desired functions.
Any typical system each of the mode states 104 through 112 will cause the system to traverse through a predetermined series of states. To avoid obscuring the invention in extraneous details of which are unrelated to the principal elements of this invention, only one of the possible series of states are shown herein. In this example, a sample series of states are shown for purchasing a ticket. In the state 114 the system queries the user regarding the city in which the flight is to begin. In the state 116 the system queries of the user regarding the city in which the flight is to terminate. The system then queries the user to learn when the user wants to travel in the state 118. The state 118 will be discussed in more detail below. Upon determining these facts, the system can then access the database of flight information in the state 120 present the list of relevant flights to the user. In the state 122 the user selects a flight from among the several choices. Thereafter, the user can exit the system or return to a previously described portion of the state diagram to perform another operation.
Depending upon the complexity of the system design, the user may be required to provide each piece of information sequentially, as described in the example above, or may be allowed to provide all the pieces of information concurrently in a single dialog state transaction. Thus the finite state diagram of FIG. 1 could be shown with more or fewer number of finite states for achieving the same function. This relative system complexity is known in the art and is ancillary to the present invention.
Conventionally, a single information transaction comprising a single utterance by the system and then a single response utterance by the user is termed a `dialog state.` Several transfers of information related to a particular topic and carried on between the system and the user via the telephone interface port is termed a `dialog` in the state-of-the-art. Generally a dialog includes several dialog states. A telephone call includes all the dialogs and dialog states which occur between a system and a user from the time the equipment goes off-hook until it returns to an on-hook state.
According to usual practice, upon receiving a voice communication from user within a dialog state the system undertakes two principal operations in response thereto. First, the system performs a natural language speech recognition operation to determine the words that were spoken by the user. To aid in the recognition operation, the system is programmed with a predetermined set of anticipated responses. A nonsequitor response will generally be unrecognizable as "out of grammar." For example, if the system queried the user about the destination city for flight and the user responded "7 p.m. on Thursday", the system will likely not be programmed to recognize those words in the context of this dialog state. Second, the system must determine whether it `understands` the words to be recognized in the context of the anticipated dialog state.
It will be understood by persons of ordinary skill in the art that establishing proper dialog state interactions is a demanding problem for natural language speech recognition systems. Consider for example of the main menu state 102. The system utterance could be "what do you want to do?" In response, the user could say "I need to visit my Aunt Gertrude who's sick in the hospital as soon as possible." The result of such a dialog state would not likely provide any useful information.
In the alternative, the system utterance could be "do you want to obtain flight information, flight availability, arrival times, purchase ticket, obtain cost information, or . . . ?" This system utterance is far more likely to receive a user utterance reply which it can recognize and understand. Thus, if the user replies "purchase ticket" the system will understand what the user wants to do. On the other hand, if the user replies "book a flight", "reserve a ticket", "make a reservation" or other similar utterances the system may or may not understand the user's intent. The main menu dialog can be designed to allow for recognizing and understanding a variety of user utterance replies to a specific system utterance.
Due to idiomatic nature of natural language it will be readily understood that some users will not understand a particular system utterance and similarly, some systems will not understand a particular user utterance. Generally speaking when a user fails to understand a system utterance, the user will either hang up or respond with a user utterance that the system views as nonsequitor. In either case the subject dialog state failed to produce the intended result. Consider the state 118 of FIG. 1. That example indicates that the system queries the user for when travel is desired. It may be that the system wants to know time of day, day of the week or calendar date or some combination or all of these elements. A user could easily answer with the wrong, or unexpected information. This would be viewed as a failed dialog state.
The results for dialog states fall within several categories. 1. The natural language speech recognition system can properly recognize the user's utterance. 2. The natural language speech recognition system can mis-recognize the user's utterance. 3. The system rejects the user's response for a variety of reasons, including that the user has a strong accent for which the system is not conditioned to understand, an ambient noise (eg., slamming door, passing fire engine, barking dog and the like) occurred during the utterance, or some other condition, any of which cause the system to be unable to recognize the user's utterance. 4. The user makes no utterance and the system provides a no-speech time out. 5. The user's utterance rambles excessively and the system provides a too-much speech time out. 6. The user makes an out of grammar utterance which is an utterance that the system is not programmed to expect. 7. A long recognition time out occurs where the recognizer takes too long to perform a recognition operation such as when the system is temporarily overloaded. 8. A failure is indicated where a user requests a transfer to an operator or a help expert system. 9. A failure is indicated where a user hangs up a call without achieving a successful result.
In order to properly design dialogs for a natural language speech recognition system which succeed in producing the intended result, system designers have tried a variety of approaches. In some cases, designers provide more and more alternative system and user utterances. Unfortunately, even if this approach achieves a favorable result there is no way to accurately understand this outcome. Still other designers prepare a series of dialog states and engage the services of human test subjects to attempt to use the system. The designers will study and analyze the users and their responses in order to determine the efficacy of the proposed dialogs and dialog states in a laboratory type setting. Generally, a user will attempt to access an operation system in what is termed a usability study. In the alternative, a researcher will act as the system and provide oral cues as if the researcher were indeed the system and record the user's responses to these pseudo-dialogs in a so-called `Wizard-of-Oz System`. Such analysis is a painstaking undertaking requiring considerable effort.
Regardless of the testing approach, the user is brought to the laboratory and given a task to perform using the system prototype or using the Wizard-of-Oz System. For example, the user could be told to reserve a ticket on a flight from a specified first city to a specified second city. The researcher either records the responses in the `Wizard-of-Oz System,` or observes the user, for example behind a two-way mirror.
Often, the designers can only study the responses of less than 50 users and often in the range of only 20 actual users. These meager results are obtained while incurring considerable expense. The amount of data actually collected is insufficient to provide meaningful results.
Further, these results may be unrealistically skewed in favor of such dialog states. This is particularly true because test subjects are less likely to hang up in frustration than an actual non-test user in a real world situation because tests subjects are motivated to properly interact with the subject system. Thus, the data does not represent real system use.
What is needed is a method of analyzing a structure of a dialog and dialog state to achieve a successful result. What is further needed is a method of analyzing a wording in each dialog and dialog state to achieve a successful result. Yet another need is a method of analyzing natural language speech recognition system dialogs in order to determine what type of dialogs and dialog states most commonly fail. What is also needed is a method of analyzing natural language speech recognition system dialogs in order to properly design and develop dialogs and dialog states which are most likely to succeed.