The GMM (Gaussian Mixture Model) Scoring operation often involves heavy computation, which may use large data structures with poor locality of reference. Accordingly, memory organization and packing are typically required to control the memory footprint of the application. When executed on standard compute engines the memory management of the data and the re-use patterns may limit the efficiency of its operation
For example, when software (SW) based solutions handle this algorithm (e.g., the GMM (Gaussian Mixture Model) Scoring operation they may rely on the statistical behavior of caches and cache prefetch behaviors to handle the data locality. Due to the high streaming nature of many parts of the application (e.g., 10-100 MB of data being read before being re-used) this can cause thrashing of data caches.
Further, data is typically organized as bytes, words or double words depending on the data type. Multiple scathe/gather instructions are typically required to un-pack the memory and set it for the compute phase. For example, when only part of the outputs are processed using an active list, which may be sparse, data may be badly scattered.
In some implementations of GMM (Gaussian Mixture Model) scoring operation may be applied to automated electronic processing of speech and other acoustic signals. Automated electronic processing of speech and other acoustic signals is challenging due, in part, to the wide variety of pronunciations, accents, and speech characteristics of individual speakers. Constraints such as language models and acoustic models may be used to make decisions about the words the user speaks, but acoustic models are often mathematically intensive.
For example, most large vocabulary continuous speech recognition systems use continuous density hidden Markov models (HMM) for the acoustic modeling of speech. An HMM may include several active states and each active state output may be modeled with a Gaussian Mixture Model (GMM) probability density function. HMMs are typically used to model sub-word units of sound or entire words. In the English language, there are approximately forty phonemes or individual units of sound that can be employed to form more complex utterances. Phonemes may be considered in context, and there are up to 64,000 triphones (i.e., sequences of three phonemes) in the English language.
A model of a phoneme in isolation may be referred to as a context independent (CI) phoneme model. A model of a sequence of phonemes may be referred to as a context dependent (CD) phoneme model. For example, in the word “cat” the /c/ sound may be modeled with a CI phoneme model and the /c/a/ sound may be modeled with a CD phoneme model. GMMs may be used to represent the state output probability density functions of CI phoneme models (i.e., a CI GMM) and CD phoneme models (i.e., CD GMM).
In conventional speech recognition systems, scores for GMMs associated with phonemes and triphones are computed for each frame of an audio signal and stored. This requires significant processing and memory usage. For real-time processing, all GMM parameters (e.g., means, variances, mixture weights) must be continually loaded resulting in a high memory bandwidth requirement. In a portable device, high computation usage and memory bandwidth may lead to a slow response time for an end user as well as a shortened battery life.