In speech recognition system, the first filter bank applied to an incoming speech signal, typically generated by a microphone mounted in a piece of portable communication equipment like in a phone, a toy, a TV set or PC etc., is normally a variant of a Mel Factor Cepstral/Cepstrum Coefficient (MFCC) filter bank, disregarding whether an underlying speech recognition system is based on neural nets (NN), Hidden Markov Models (HMM), or factor graphs (FG). The purpose of the complete speech recognition system is to provide voice activated control over device functions such as device wake-up or power-on from a sleep-mode. However, MFCC filter banks of known speech recognition devices and systems are computationally complex and for this reason often executed on a programmable application processor such as a programmable fixed or floating point DSP core or engine. These types of DSP cores often use 24 or 32 bit word lengths for representation for incoming speech/audio signal samples leading to datapath circuits, data registers and logic with correspondingly large word lengths to accommodate the word format of the incoming audio samples. This feature leads to high power consumption in the MFCC filter bank during processing of the incoming speech or audio signal which is a significant problem or obstacle for the application of MFCC based speech recognition in portable/battery powered equipment.
Furthermore, since the speech recognition application or program traditionally executes on the programmable external application processor, e.g. a DSP core, it has to continuously reside in an active mode of operation to detect the presence of the target word, phrase or command in the incoming microphone signal. This requirement for continuous operation of the programmable external application processor presents an obstacle for providing voice activated system power up due to high power consumption of the continuously operating programmable application processor. The high power consumption is a significant problem for the speech recognition application in both battery powered portable equipment and mains connected electrical equipment in view of battery life-time and the on-going world-wide efforts to reduce energy consumption of electrical equipment throughout the industrialized world. Hence, it would be of considerable benefit to provide a separate microphone circuit assembly comprising a speech recognition unit with low power consumption and capable of operating independently of the external application processor. The microphone circuit assembly could comprise a speech recognition unit capable of recognizing one or more predetermined target word(s) or phrase(s) and indicating the recognition of such target word(s) or phrase(s) to the external application processor by transmission of a suitable recognition signal. Hence, such a microphone circuit assembly will allow the external application processor to reside in sleep-mode without processing of the microphone signal by delegating the recognition task of the target word or phrase in the incoming microphone signal to the microphone circuit assembly. The microphone circuit assembly may indicate the recognition of the target word or phrase to the external application processor by a suitable recognition signal allowing the application processor to switch from the sleep-mode to an active mode and take appropriate action in response.
EP 0 871 157 A2 discloses a speech recognition method and apparatus. The speech recognition device receives its input speech signal s(n) from a microphone. The speech signal is transformed into a digital form by means of an ND converter using a sampling frequency of 8 kHz and 12 bits of resolution per sample. The speech recognition device comprises a front-end where the speech signal is analyzed and a feature vector is modeled. The feature vector may be modeled by defining Mel-Frequency Cepstral Coefficients (MFCC).
U.S. 2003/110033 A1 discloses a method and system for real-time speech recognition. The speech recognition is based on the MFCC algorithm and Hidden Markov Models (HMM). The speech recognition system may be implemented on a DSP suitable for a low resource environment. A WOLA filter bank is working as co-processor to a DSP core and applies a 256 point FFT to consecutive or running segments of the digitized input speech signal.
The paper ‘A Real Time Noise-Robust Speech Recognition System’, Wada et al., ECTI November 2005 discloses a speech recognition method and apparatus based on custom hardware such as a full-custom ASIC design or a FPGA design. A speech recognition device is based on a FPGA board. Speech input signals to the speech recognition device on the FPGA board are generated by sampling a microphone signal by an ND converter with a sampling rate of 11.025 kHz quantizing speech samples to 12-bits word length.