This specification relates to natural language processing, and more specifically, to automatic speech recognition.
Speech input received by a speech recognition system is typically a signal captured through a noisy channel, e.g., a microphone in a noisy environment. Automatic speech recognition or speech processing is a computational process for converting a speech signal into a sequence of symbols or tokens in a desired output domain, such as a sequence of known phonemes, syllables, letters, and/or words. In many applications, such as automated dictation and automated digital assistance, accurate and speedy transcription from a voice input to a corresponding word sequence is critical to the quality and effectiveness of the applications.
Statistical acoustic modeling techniques, such as those involving hidden Markov models (HMM) and n-gram modeling, are often used to create the framework for automatic speech recognition. Typically, state of the art acoustic modeling uses numerous parameters to describe the variations in speech in a given language. For example, while English has less than 50 phonemes (elementary units of sound), acoustic models in state-of-the-art systems commonly employ tens to hundreds of thousands of parameters (e.g., Gaussian components) to characterize the variations in real speech samples. The high dimensionality required by the state-of-the-art acoustic models reflects the extreme variability involved in the acoustic realization of the underlying phoneme sequences. As a result of this over-dimensioning, these state-of-the-art systems consume vast computational resources, making them difficult to deploy on a mobile platform, such as a smartphone, without compromising recognition accuracy.