The present invention relates in general to speech recognition and, more particularly, to interactive speech applications for use in automated assistance systems.
Pattern recognition generally, and recognition of patterns in continuous signals such as speech signals, has been a rapidly developing field. A limitation in many applications has been the cost of providing sufficient processing power for the complex calculations often required. This is particularly the case in speech recognition, all the more so when real-time response is required, for example to enable automated directory inquiry assistance, or for control operations based on speech input. Information can be gathered from callers and/or provided to callers, and callers can be connected to appropriate parties within a telephone system. To simulate the speed of response of a human operator, and thus avoid a perception of xe2x80x9cunnaturalxe2x80x9d delays, which can be disconcerting, the spoken input needs to be recognized very quickly after the end of the spoken input.
The computational load varies directly with the number of words or other elements of speech (also referred to as xe2x80x9corthographiesxe2x80x9d), which are modeled and held in a dictionary database, for comparison to the spoken input. The number of orthographies is also known as the size of vocabulary of the system. The computational load also varies according to the complexity of the models in the dictionary, and how the speech input is processed into a representation ready for the comparison to the models. Also, the actual algorithm for carrying out the comparison is a factor.
Numerous attempts have been made over many years to improve the trade off between computational load, accuracy of recognition, and speed of recognition. Depending on the size of vocabularies used, and the size of each model, both the memory requirements and the number of calculations required for each recognition decision may limit the speed/accuracy/cost trade off. For useable systems having a tolerable recognition accuracy, the computational demands are high. Despite continuous refinements to models, speech input representations, and recognition algorithms, and advances in processing hardware, there remains great demand to improve the above mentioned trade off, especially in large vocabulary systems, such as those having greater than 100,000 words.
The present invention is directed to a system and method for speech recognition using multiple processing stages that use different vocabulary databases to improve processing time, efficiency, and accuracy in speech recognition. The entire vocabulary is divided into smaller vocabulary subsets, which are associated with particular keywords. A small vocabulary subset is generated or retrieved based on certain information, such as a calling party""s locality. A user is prompted to provide input information, such as the locality in which a business whose phone number is requested is located, in the form of a spoken utterance to the system. If the utterance matches one of the entries in the initial small vocabulary subset, then the utterance is considered to be recognizable. If the utterance is not recognizable when compared to the initial small vocabulary subset, then the utterance is stored for later use. The user is then prompted for a keyword related to another subset of words in which his initial utterance may be found. A vocabulary subset associated with the received keyword is generated or retrieved. The initial stored utterance is then retrieved and compared to the newly loaded vocabulary subset. If the utterance matches one of the entries in the newly loaded vocabulary subset, then the utterance is recognizable. Otherwise, it is determined that the initial utterance was unrecognizable, and the user is prompted to repeat the initial utterance.
The foregoing and other aspects of the present invention will become apparent from the following detailed description of the invention when considered in conjunction with the accompanying drawings.