Most large vocabulary continuous speech recognition systems use continuous density hidden Markov models (HMM) for the acoustic modeling of speech. An HMM may comprise several active states and each active state output may be modeled with a Gaussian Mixture Model (GMM) probability density function. HMMs are typically used to model sub-word units of sound or entire words. In the English language, there are approximately forty phonemes or individual units of sound that can be employed to form more complex utterances. Phonemes may be considered in context, and there are up to 64,000 triphones (i.e., sequences of three phonemes) in the English language.
A model of a phoneme in isolation may be referred to as a context independent (CI) phoneme model. A model of a sequence of phonemes may be referred to as a context dependent (CD) phoneme model. For example, in the word “cat” the /c/ sound may be modeled with a CI phoneme model and the /c/a/ sound may be modeled with a CD phoneme model. GMMs may be used to represent the state output probability density functions of CI phoneme models (i.e., a CI GMM) and CD phoneme models (i.e., CD GMM).
In conventional speech recognition systems, scores for GMMs associated with phonemes and triphones are computed for each frame of an audio signal and stored. This requires significant processing and memory usage. For real-time processing, all GMM parameters (e.g., means, variances, mixture weights) must be continually loaded resulting in a high memory bandwidth requirement. In a portable device, high computation usage and memory bandwidth may lead to a slow response time for an end user as well as a shortened battery life.