The present invention relates to pattern recognition. In particular, the present invention relates to processing signals used in pattern recognition.
A pattern recognition system, such as a speech recognition system, takes an input signal and attempts to decode the signal to find a pattern represented by the signal. For example, in a speech recognition system, a speech signal (often referred to as a test signal) is received by the recognition system and is decoded to identify a string of words represented by the speech signal.
To decode the incoming test signal, most recognition systems utilize one or more models that describe the likelihood that a portion of the test signal represents a particular pattern. Examples of such models include Neural Nets, Dynamic Time Warping, segment models, and Hidden Markov Models (HMMs).
Most commercially-available speech recognition systems use HMMs to match speech patterns in speech which is divided into overlapping “frames”, often separated from one another by approximately ten milliseconds. Decomposing speech into these ten millisecond frames is just one example of an input being transformed into a series of time-sequenced frames. Traditionally, evaluation of these frames takes place one frame at a time; all HMMs are updated for a single frame in round-robin fashion before moving on to the next frame.
FIG. 3 illustrates a basic representation of an HMM, as described in detail in many texts including, for example, Chapter 8 of Spoken Language Processing, by Huang, Acero and Hon. At any given time, the model has a given probability of being in any of the various states. Each state has an output probability distribution and transition probabilities to other states. In the case of speech, the output distribution models an acoustic feature set derived from raw speech waveforms broken into the 10 millisecond frames. These transition and output probabilities are generated by a training step in accordance with known techniques.
The decoding problem for HMMs is, given an HMM and a sequence of observations, what is the most likely state sequence that produces the sequence of observations? The standard method of solving this problem is called Dynamic Programming and is illustrated in FIG. 4. The six-state HMM described with respect to FIG. 3 has been turned on its side, and time runs along the horizontal axis. Each “point” (a combination of state and time) in this grid represents a probability that the HMM is in that state at that time, given the observations. One possible path through the DP matrix is highlighted in bold (1-8-14-21-27-33-40-47-53-59-66), representing a particular alignment or state sequence. The probability for a point depends on the probabilities of the previous points, on the transition probabilities and on the output probabilities for that time step. Since each point depends on several previous points, calculating the probability for that point requires having first calculated the previous points, thereby placing limits on the order of calculation. A “time-synchronous” evaluation order is shown in FIG. 4 by the numbers within the points; the system evaluates all states for a given time step before starting again with the next time step. Note, this is not the only possible evaluation order since any order that calculates a point only after its predecessors have been calculated is allowable. The gray points illustrate states that are either unreachable or do not lead to possible finish states and so do not need to be evaluated, although many implementations evaluate them anyway.
In a real-time system, there may be tens of thousands of such HMMs running at the same time. These models consume enough computer memory that each pass through the entire model set often exhausts the CPU cache capacity. This slows speech processing considerably since memory operations involving solely CPU cache occur many times faster than memory operations involving higher level memory.
Another method of solving the problem is described in a paper entitled Time-First Search For Large Vocabulary Speech Recognition, by Tony Robinson and James Christie. This method essentially switches the order of HMM evaluation from evaluating multiple models for a given time frame, to evaluating multiple time frames for a given model. This method purports to reduce processing memory requirements while cooperating with standard CPU memory cache operations because a number of operations fall into the same physical memory range.
In order to provide real-time continuous speech recognition for large vocabulary applications, further developments are required to improve not only the efficiency of CPU cache use, but the efficiency of the processing routine itself. Thus, not only is processing speed of critical importance, but processing accuracy as well.