1. Field of the Invention
The present invention relates to speech recognition, and, more particularly, to a method and system for improving the execution speed of an acoustic fast match.
2. Discussion of the Prior Art
In a speech recognition system using hidden Markov models, several thousand computations are required to compare each segment of acoustics against a pre-stored word model. Carrying out these computations for all words in the stored vocabulary is prohibitive if the goal is large vocabulary real-time speech recognition on a modest amount of hardware. In such a system, a matching algorithm is required that can rapidly identify a small set of candidate words from the whole acoustic vocabulary for further evaluation in a particular time region of the decoded utterance. In speech recognition systems based on an asynchronous stack search, the acoustic fast match provides the desired rapid identification capability. The acoustic fast match represents one of the three major functional components of a speech recognition system, the other two being the detailed match and the language model.
Conventional approaches to the implementation of the acoustic fast match can be divided into two major groups, the synchronous search and the asynchronous search. Synchronous searches suffer from several disadvantages. First, all the active word models have to be stored in memory, and thus memory requirements can be prohibitive in large vocabulary systems. Second, the estimation of word beginning probabilities requires the search to be performed in the backward direction, which significantly limits the use of this method in real time applications. For a discussion of this type of approach, see Austin, S., Schwartz, et. al, "The Forward-Backward Search Algorithm", ICASSP91, Toronto, Canada, pp. 697-700 (1991).
In the asynchronous search, for a given time region of a speakers utterance, a search is performed by computing the total acoustic score for each word in the acoustic vocabulary, one word at a time. Each word in the acoustic vocabulary is represented by its phonetic sequence (i.e. deal="d"-"eh"-"1"). To obtain the acoustic score of a particular word from the vocabulary, the acoustic scores of all of the individual phones that collectively define that word are computed and then combined into a single wordscore. To reduce the amount of computation, the phonetic sequences that define each word in the acoustic vocabulary are organized into a tree structure.
In addition to constructing an acoustic vocabulary tree structure, further computational savings may be realized by performing a pruning algorithm. Pruning operates by recognizing that when the tree is traversed to compute word-scores, the candidate words of interest will generate the highest word scores. More particularly, in an asynchronous search, a search algorithm traverses the tree structure from a root node along a nodal path where each node in the path represents a constituent phone of the word to be scored, if the computation of a partial word score at a particular node results in a value that falls below either some absolute threshold or is low when compared to other nodes, it is apparent at that time that all words derived from this node will be low, and as a consequence, the whole subtree can be ignored. This process is called pruning.
Despite the advantages achieved by utilizing a acoustic vocabulary tree structure along with a pruning algorithm, when the acoustic vocabulary becomes very large (e.g. more than 60,000 words) the time spent in the fast match algorithm can become very significant. Generally, the efficency of the algorithm is reduced in proportion to the increased vocabulary size since the fast match complexity is directly proportional to the number of words in the acoustic vocabulary. It is therefore desirable to devise an improved fast match algorithm that eliminates or significantly reduces the effects of increased vocabulary size on the algorithm's efficiency.