Automatic speech recognition (ASR), which converts words spoken by a person into readable text, has been a popular area of research and development for the past decades. The goal of ASR is to allow a machine to understand continuous speech in real time with high accuracy, independent of speaker characteristics, noise, or temporal variations. Nowadays, ASR technology can be easily found in a number of products, including Google Now (Google), Siri (Apple), and Echo (Amazon). In these ASR systems, speech recognition is activated by specific keywords such as “Okay Google,” “Hey Siri,” and “Alexa.” Many of the ASR systems perform such wake-up keyword detection tasks in an always-on mode, always listening to surrounding acoustics without a dedicated start control. Minimizing power consumption of such always-on operations that can detect multiple keywords is crucial for mobile and wearable devices. The speech recognition task that follows keyword detection is much more computation and memory intensive such that it is typically offloaded to the cloud. In fact, a number of commercially available systems do not allow speech recognition if the device is not connected to the Internet. To expand the usage scenarios for mobile and wearable devices, it is important that the speech recognition engine has low hardware complexity and operates with a low-power budget.
One of the widely used approaches for speech recognition employs a hidden Markov model (HMM) for modeling the sequence of words/phonemes and uses a Gaussian mixture model (GMM) for acoustic modeling. The most likely sequence can be determined from the HMMs by employing the Viterbi algorithm. For keyword detection, a separate GMM-HMM could be trained for each keyword, while the out-of-vocabulary (OOV) words are modeled using a garbage or filler model. In recent years, employing deep neural networks (DNNs) in conjunction with HMM models for keyword detection and speech recognition have shown substantial improvements in classification accuracy. Other prior works in ASR featured recurrent neural networks or convolutional neural networks.