There are two primary approaches of performing speech recognition: 1) template matching based approaches such as Dynamic Time Warping (DTW); and 2) statistical analysis based approaches such as Hidden Markov Model (HMM). In speech recognition, an analog speech signal generated by a suitable sound capture device such as a microphone is sampled and quantized, typically through an analog-to-digital converter operating at an appropriate sampling rate (e.g., 8-20 kHz), to provide a digitized speech signal in the form of digital samples. Short-time spectral properties of the digitized speech signal are then analyzed by successively placing a data window over the digitized speech signal, each data window corresponding to a group of successive digital samples that will be referred to as a “frame” herein. The digital samples in each frame are then analyzed according to the speech recognition approach being utilized.
In a template matching based approach such as DTW typically two sequences that may have temporal variation are compared. During a training phase, speech features such as the Mel Frequency Cepstral Coefficient (MFCC) are generated and stored per frame to serve as a “template”. In a testing phase, MFCC's are once again generated per frame for test speech being processed and this MFCC sequence is compared with the stored template MFCC sequence and the one which gives the smallest difference is chosen as the word output.
Circuitry implementing the DTW approach generates a template of a word or phrase, and then when a user speaks a phrase the circuitry generates a sequence representing this speech and compares this sequence to the template. Both the template and the speech have spectral and temporal components and when the comparison yields a close enough match the circuitry determines that an authorized user has spoken the proper phrase, typically referred to a code phrase. With the DTW approach the circuitry, such as a processor and associated memory devices, must store a template for every code phrase that is to be recognized. As a result, the circuitry implementing the DTW approach typically requires a relatively large amount of memory. For example, if a particular application requires the circuitry implementing the DTW to recognize a relatively large number of code phrases then this may require a relatively large amount of memory for storage of the associated templates. The availability or cost of this required memory may be an issue in certain types of electronic devices, such as mobile devices like smartphones.
In statistical analysis based approaches such as HMM, a statistical model of the word to be recognized is built. The general structure of the HMM approach is shown in FIG. 1 and typically it consists of a number of states, each of them with their observable output and their transition probability to the other states. In order to define the required parameters, multiple utterances of the words to be recognized are needed. FIG. 2 shows how a HMM is built for the word “apple.” The phonetic pronunciation of this word is shown at the top of FIG. 2 and in HMM and other statistical analysis based approaches typically a subword or a phoneme of the word to be recognized is modeled instead of the whole word so that these subwords or phonemes can then be used to model different desired words.
The HMM is a statistical model that requires a relatively complex training procedure and consequently is typically relatively computationally complex. This complexity may cause circuitry that implements the HMM, such as a processor executing corresponding software, to consume significant amounts of power. As a result this computational complexity and resulting power consumption make the HMM unsuitable for many low power applications, such as in handheld, batter powered devices like smartphones and tablet computers that require the circuitry implementing the HMM to consume relatively low power. For example, in an application where a smartphone is in a sleep or low-power mode of operation any circuitry, including voice recognition circuitry, operable during this mode must of course be low power circuitry or the purpose of the lower-power mode would be defeated.
There is a need for improved methods and circuits for voice recognition in low power applications such as in battery powered devices like smartphones.