Real-time data pattern recognition is increasingly used to analyse data streams in the process of controlling small and networked electronic systems. For example, speech recognition systems are increasingly common in the mobile, server, and PC markets. On the low end of the capability spectrum, speech recognition systems need to recognize connected digits (vocabulary of 10) or alphabet letters (vocabulary of 26). While on the high end of the spectrum, a 5,000 word continuous dictation capability may be necessary. If grammatical models are also included then a 20,000 trigram vocabulary could be required.
The word error rate in speech recognition systems is significantly higher than for human speech recognition. In some cases (in particular in noisy environments) machine speech recognitions systems may have an order of magnitude higher error rate than a human listener.
Large vocabulary speech recognition systems are typically composed of a signal processing stage (feature extractor) followed by an acoustic modeling stage (senone calculator), followed by a phoneme evaluator (Viterbi search), and followed by a word modeler.
In the signal processing stage, techniques such as linear predictive coding (LPC) or fast fourier transforms (FFT) are applied in order to extract a parametric digital representation of the incoming signal. This procedure is repeated at regular time intervals, or frames, of approximately 10 ms.
In the acoustic modeling stage, these parametric observation vectors are then compared to the senones stored in memory (the term “senone” denotes a basic subphonetic unit). The comparison of the parametric observation vector with the senones is a computation and memory intensive task, as up to 20,000 senones are compared every 10 ms. During this comparison, a multivariate Gaussian probability may be calculated for each senone, and represents the mathematical “distance” between the incoming feature vector and each of the stored senones.
In the phoneme evaluation stage, Hidden Markov Models (HMMs) may be used to model phonemes as a sequences of senones, where specific senones are probabilistically associated with a state in an HMM. For a given observed sequence of senones, there is a most likely sequence of states in a corresponding HMM. This corresponding HMM is then associated with the observed phoneme. In order to find the most likely phoneme corresponding to a sequence of senones, the Viterbi algorithm is often employed.
The Viterbi algorithm performs a computation which starts at the first frame and then proceeds one frame at a time, in a time-synchronous manner. A probability score is computed for each senone in the HMMs being considered. Therefore, a cumulative probability score is successively computed for each of the possible senone sequences as the Viterbi algorithm analyzes the sequential observation vectors. By the end of an utterance, the HMM having the highest probability score computed by the Viterbi algorithm provides the most likely phoneme for the entire sequence.
The acoustic modeling stage is the computational bottleneck of the speech recognition process. This is due to two factors: 1) the large number of floating point calculations required to evaluate the multivariate Gaussian probabilities of each senone, and 2) the memory bandwidth limitations of accessing the senone data.
Evaluation of a standard SPHINX3 speech recognition system on a 1.7 GHz x86 microprocessor based platform showed that a 1000-word task took 160% longer than real time to process and consumed a significant portion of the memory bus bandwidth. This bottleneck severely restricts the ability of mobile appliances to run large vocabulary speech recognition software with a similar architecture, due to the slower processing speed and reduced power requirements of mobile processors.
Issues with the speed and storage/processing capabilities of speech recognition systems exemplify complexities associated with analysing data streams in real-time or close to real time. Thus the problems associated with speech recognition may be generalized to the analysis to other data streams ranging from streaming media to the analysis of signal behavior in smart utility networks.
Thus, a need still remains for systems and methods for reducing bottlenecks in the analysis of data patterns in electronic and networked systems such as speech recognition systems used in cell phones. In view of the increasing need for real-time data analysis in the control of electronic devices and networks, it is increasingly critical that answers be found to these problems.
Further, in view of the ever-increasing commercial competitive pressures, along with growing consumer expectations and the diminishing opportunities for meaningful product differentiation in the marketplace, it is critical that answers be found for these problems.
Additionally, the need to reduce costs, improve efficiencies and performance, and meet competitive pressures adds an even greater urgency to the critical necessity for finding answers to these problems.
Solutions to these problems have been long sought but prior developments have not taught or suggested any solutions and, thus, solutions to these problems have long eluded those skilled in the art.