1. Field of the Invention
The present invention relates to spoken dialog systems and more specifically to using meta-data for language models to improve speech processing by speech modules such as an automatic speech recognition module.
2. Introduction
Spoken dialog systems are becoming more prevalent in society. Technology is improving to enable users to have a good experience in speaking to a dialog system and receive useful information. The basic components of a typically spoken dialog system are shown in FIG. 1. A person 100 utters a word or a phrase that is received by the system and transmitted to an automatic speech recognition (ASR) module 102. This module converts the audible speech into text and transmits the text to a spoken language understanding (SLU) module 104. This module interprets the meaning of the speech. For example, if a person says “I want to find out the balance of my checking account,” the SLU module 104 will identify that the user want his account_balance (checking). The output of the SLU module 104 is transmitted to a dialog manager (106) that determines what response to provide. The response is transmitted to a spoken language generation module (LG) 108 that generates text for the response. For example, in the above example, the response may be “OK, thank you. Your checking account balance is one hundred dollars.” The text of the response is then transmitted to a text-to-speech module (110) that converts the text into audible speech which the user then hears to complete the cycle.
One of the challenges of spoken dialog systems is dealing with names. A transcription system that requires accurate general name recognition and transcription may be faced with covering a large number of names that it will encounter. When developing a spoken dialog system, language models are trained using expected words and phrases to help the system interact with the user according to an expected “domain.” For example, a spoken dialog system for a bank will have a set of expectations regarding user requests. Having a known domain helps designers prepare the spoken dialog system to achieve a recognition accuracy that is acceptable. In a banking domain, words and phrases such as “account balance”, “checking”, “savings”, “transfer funds” are expected and may be part of a finite grouping.
However, without prior knowledge of the names of people, a spoken dialog system will require a large increase in the size and complexity of the system due to the expansion of the lexicon. Furthermore, this increase will adversely affect the system performance due to the increased possibility of confusion when trying to recognize different names. One example of a system that must have accurate name transcription by its ASR module is a directory assistance and name dialer system. Building such a system is complex due to the very large number of different names it may encounter. An additional complicating factor is the pronunciation of names which can vary significantly among speakers. As a result, ASR research on name recognition has received a fair amount of attention. The feasibility of a directory assistance application with as many as 1.5 million names has been investigated and it has been shown that recognition accuracy drops approximately logarithmically with increasing vocabulary size. A significant degradation in performance with increasing lexicon size has also been shown. Larger lexicons that allow more diverse pronunciations can be beneficial. Most efforts have focused on soliciting more detailed speech input from the user in the form of spelling, and have shown that this improves the system performance. Neural networks have also been shown to focus the search on the most discriminative segments in a multi-pass approach. One attempt has shown improvement in name recognition accuracy by incorporating confidence scores into the decision process.
Common among all previous work is that the coverage issue was addressed by increasing the vocabulary size. The increased confusability introduced by that increase is then addressed by more complex search and acoustic modeling, which is more costly. Therefore, what is needed in the art is an improved system and method for recognizing names or other similarly situated words or phrases in a spoken dialog. The improved system and method should be less costly and time consuming.