1. Field of the Invention
The present invention relates to speech recognition and, more particularly, to methods and apparatus for improving an acoustic fast match speed of a speech recognition system using a cache for phone probabilities.
2. Description of the Prior Art
In a speech recognition system, the acoustic fast match represents one of the three major functional components of the system, the other two being the detailed match and the language model. The role of the fast match is to select a short list of word candidates from the whole acoustic vocabulary for further evaluation in a particular time region of the decoded utterance.
Conventional approaches to the implementation of the acoustic fast match can be divided into two major groups, the synchronous search and the asynchronous search. The synchronous search is usually a form of the Viterbi search algorithm. At each instance, all necessary computations are performed, so the same time region of the utterance is never evaluated more than once. There are several disadvantages to this method. First, all the active word models have to be stored in memory, and thus memory requirements can be prohibitive in large vocabulary systems. Second, the estimation of word beginning probabilities requires the search to be performed in the backward direction, which significantly limits the use of this method in real-time applications. For a discussion of this type of approach, see Austin, S., Schwartz, et al., "The Forward-Backward Search Algorithm", ICASSP91, Toronto, Canada, pp. 697-700 (1991).
In a conventional asynchronous search, for a given time region of the utterance, the search is performed by computation of the total acoustic score for each word in the vocabulary, one word at a time. To reduce the amount of computations, the word phonetic sequences can be organized into a tree structure. The memory requirements are negligible when compared to the time synchronous method. The fast match search is performed each time a partial hypothesis (sequence of word, evaluated by the detailed match and the language model) needs to be extended. The ending time of such hypothesis is the starting time for the fast match, which means the beginnings of word candidates are already given and do not need to be calculated, thus the second problem of the synchronous search is eliminated. However, one of the disadvantages of this method is that the match has to be repeated for each new beginning time of the fast match search (even if the new time is relatively close to the previous one), thus it is possible that a certain region of the utterance is evaluated many times with the same phone sequence. For a discussion of this type of approach, see L. R. Bahl, et al., "A Fast Approximate Acoustic Match for Large Vocabulary Speech Recognition", IEEE Transactions on Speech and Audio Processing, vol. 1, no. 1, pp. 59-67 (January 1993).