The present invention relates generally to speech recognition systems and, more particularly, to a system and method having reduced memory access requirements associated with the recognition of an isolated word.
Isolated speech recognition is a process in which an unknown spoken utterance (or word) is identified. Through a process known as training, signals representing known words are examined and features of the words are determined and recorded for use as recognition models (or patterns) in a speech recognizer memory. The recognition models represent typical acoustic renditions of known words. In the training process, a training algorithm is applied to the recognition models to form these stored representations that are used to recognize future unknown words.
Speech recognition is generally implemented in three stages, as illustrated in FIG. 1. In step 100, an unknown speech signal is received via, for example, a microphone and processed to produce digital data samples. In step 110, features that are based on a short-term spectral analysis of the unknown speech signal are determined at predetermined time intervals. As will be appreciated, these features, commonly called "feature vectors," are usually the output of some type of spectral analysis technique, such as a filter bank analysis, a linear predictive coding analysis, or a Fourier transform analysis. In step 120, the feature vectors are compared to one or more of the recognition models that have been stored during the above-described training process. During this comparison, the degree of similarity between the feature vectors and recognition models is computed. Finally, in step 130, the speech recognizer determines, based on the recognition model similarity scores, the recognition model that best matches the unknown speech signal. The speech recognizer then outputs the word corresponding to the recognition model having the highest similarity score.
Most of today's speech recognizers are based on the Hidden Markov Model (HMM). As will be appreciated, the HMM provides a pattern matching approach to speech recognition as described in detail in "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition," by Lawrence R. Rabiner, Proceedings of the IEEE, Vol. 77, No. 2, February 1989, pp. 257-286, the entirety of which is incorporated by reference herein.
An HMM is generally defined by the following elements:
1. The number of states in the model, N; PA1 2. The state-transition matrix ##EQU1## PA1 where a.sub.ij is the probability of the process moving from state q.sub.i to state q.sub.j at time t=1, 2, etc. and given that the process is at state q.sub.i at time t-1; PA1 3. The observation probability distribution, EQU b.sub.i (o), i=1, . . . , N for all states q.sub.i, i=1, . . . , N; PA1 4. The initial state probability .pi..sub.i for i=1, 2, . . . , N. PA1 Initialization: ##EQU3## PA1 Termination: ##EQU5##
and
The Viterbi algorithm is commonly used in HMM speech recognizers to perform the comparison and decision operations described in FIG. 1. The Viterbi algorithm may quite simply be stated as: given the observations .sub.o.sub..sub.t +L , t=1, 2, . . . , T, where T is the duration of the detected speech measured in number of feature vectors, find the most probable state sequence for each model in the vocabulary and choose the model with the highest probability. The following represents a conventional pattern matching algorithm for performing this task.
For every speech model .lambda..sub.m, where m=1, . . . , M, the following processes are performed:
Pre-processing: ##EQU2##
Recursion: ##EQU4##
where the score .delta..sub.t (j) is an approximation of the logarithm of the probability for the most probably path passing node j at the time t and P.sub.m * is the logarithm of the probability for the most probably path ending at node N at time T. The recognition result (i.e., the word to which the unknown speech signal corresponds) is .lambda.=.lambda..sub.m, where ##EQU6##
The above-described conventional pattern matching algorithm has four stages, namely, a pre-processing stage, an initialization stage, a recursion stage and a termination stage. In the pre-processing stage, logarithmic values of the initial state probability .pi..sub.i for i=1, . . . , N, the description of the feature probability distribution b.sub.i (.sub.O.sub..sub.t ) where 1.ltoreq.i.ltoreq.N and 1.ltoreq.t.ltoreq.T, and the state-transition probabilities a.sub.ij, where i.gtoreq.1 and j.ltoreq.N, are computed and stored in memory. The function b.sub.j (o) and the values of the initial state probability .pi..sub.i and the state-transition probabilities a.sub.ij generally depend upon the particular speech model .lambda.m being considered. However, in order to decrease the amount of data describing the models, some of the constants are set to be equal regardless of the model. For example, the initial state probabilities are often set to .pi..sub.1 =1, .pi..sub.i =0 when i&gt;1 for all the speech models. These logarithmic values that are determined during the pre-processing stage are generally computed and stored during the "training" of a speech recognizer. It will be appreciated that, in those situations where the value of a particular probability is equal to zero, the following convention will be used log (0)=-.infin.. Since the pre-processing stage is performed once and saved, the cost of this processing stage to most systems is negligible.
In the initialization stage, the path scores .delta..sub.t (j) is calculated at time 1 for state i, where 1.ltoreq.i.ltoreq.N. This calculation involves fetching the logarithmic values of the state-transition probabilities au and the description of the function b.sub.j (o) for j=1. In the recursion stage, the score .delta..sub.t (j) is calculated for state i, ranging from 1 to N, at time t, where 2.ltoreq.t.ltoreq.T, and state j, where 1.ltoreq.j.ltoreq.N. It will be appreciated that the first score calculation is determined for t=2 and state j ranging from 1 to N. Since the value of j changes for each calculation, the description of the function b.sub.j (o) also changes for each calculation. As such, a memory fetch operation is performed for each calculation involving a different j value in order to retrieve the appropriate description of the function b.sub.j (o). The value of t is then incremented to 3 and the score is calculated for state j ranging from 1 to N. It is evident from this calculation that memory accesses in the order of O(N.sup.2) are performed during this stage for each model m.
Finally, during the termination stage, the highest probability result (or best path score) for each specific model is determined from the calculations obtained in the recursion stage. An overall best path score is obtained by comparing the best path scores obtained for each model m. Additional information regarding the above-described conventional algorithm can be found in "Fundamentals of Speech Recognition," by Lawrence R. Rabiner et al., Prentice Hall, 1993, pp. 321-389, which is incorporated by reference herein.
Looking closely at the recursion operation described above, it is apparent that memory accesses in the order O(MN.sup.2) are needed for each time interval t from 2 to T. Often, the so-called Bakis model (or left-right model) is used which requires that a.sub.ij =0j&lt;i, and j&gt;i+.DELTA., where .DELTA. represents the maximum number of states that can be jumped in moving along a single path. If the Bakis model is used, the number of memory accesses needed for the above-described algorithm is reduced to the order of O(M.DELTA.N).
FIG. 2 illustrates the order in which calculations are performed for a speech recognizer using the Bakis model with .DELTA.=2. Each time instant tin FIG. 2 may correspond to a time instant when the feature extraction unit is delivering feature vectors. As illustrated, one ##EQU7##
calculation is performed at each node, starting at time 2, state 1, to time 2, state N. Each calculation involves, inter alia, one or more memory access operations in order to retrieve the necessary data stored during the pre-processing stage. The calculations and memory access operations are then performed for the node at time 3, state 1 to the node at time 3, state N. These calculations and memory access operations continue for all nodes, ending at the node at time T, state N.
It will be appreciated that designers of speech recognition systems desire the presence of a large vocabulary size since a large vocabulary size allows the speech recognition system to match an input speech signal to a greater number of words. However, a large vocabulary size often requires that the speech reference data (i.e., the data used in the recursion stage of the above-described pattern matching algorithm) be stored in external memory. It is well established that accesses to external memory are slower than accesses to a system's internal memory. As such, it is desirable to limit the number of external memory access operations since a large number of external memory access operations can lead to intolerable delays.
There exists a need for a system and method for reducing the number of external memory access operations in speech recognition systems compared to conventional techniques.