The present invention deals with identifying semantic intent in acoustic information. More specifically, the present invention deals with grouping acoustic information (such as acoustic information from call logs) into clusters, each representing a category of semantic intent.
Automatic voice response systems have gained increasing popularity in enhancing human-machine interaction. Conventional automatic voice response systems allow a user to call the system using a telephone and then navigate through a voice-responsive menu in order to receive desired information, or to be routed to a desired destination. For instance, in some such systems, a user may call to review an account summary of the user's account with a particular business. In that case, the user may navigate through an account summary menu, using voice commands, to obtain an account balance, for example.
In another such system, the user may dial the general telephone number of a company and navigate through a voice-responsive menu to reach a particular individual at the company, or to reach a department, such as “technical service”.
These types of systems have encountered a number of problems. In such systems, rules-based finite state or context free grammars (CFGs) are often used as a language model (LM) for simple, system-initiative dialog applications. This type of restricted strategy often leads to high recognition performance for in-grammar utterances, but completely fails when a user's response is not contained in the grammar.
There are at least two causes for such “out-of-grammar utterances”. First, the syntactic structure of the utterance may not be parsed consistently by the CFG. For instance, a user's response of “twentieth of July” may cause failure in a grammar which is structured to include a rule [month] [day]. Second, the user's utterance may reflect a semantic intent which was not anticipated by the author of the grammar. For instance, in a corporate voice dialer application, the grammar for the response to the opening prompt “Good morning, who would you like to contact?” may be designed to expect the user to provide a name. However, the user may instead respond by identifying a department such as “human resources.”
In sum, at the application design stage, it is difficult for an application developer to anticipate all the different ways in which a user may frame a request, which leads to the first problem. Similarly, it is difficult for an application developer to anticipate all the different semantic intents that the user may have, leading to the second problem.
Many attempts have been made to address the first problem (the difficulty in anticipating the different ways a user may frame a request) by building more robust language models. For example, hand-authored combinations of context free grammars (CFGs) with statistical language models has been attempted.
Prior attempts at solving the second problem (anticipating all the different semantic intents used by the user) typically require a large amount of transcribed and semantically annotated data from actual user calls. Of course, this tends to be extremely expensive to generate. For instance, in order to generate this type of semantically annotated data, the actual incoming calls must be recorded. Then, a human being must typically listen to all of these recordings in order to identify any semantic intents used by the caller, that were not yet expected or anticipated by the developer. However, a large company, which generates the call volumes necessary to obtain a useful quantity of data, may receive several thousand calls per day. Even if the human being only listens to the calls which failed in the interactive voice response unit (e.g., calls which ended in hang-ups) and if those calls only made up ten to twenty percent of the entire call volume, this would require the human to listen to hundreds of calls each day. This is extremely time consuming and expensive.