Natural language processing systems include various modules and components for receiving textual input from a user and determining what the user meant. In some implementations, a natural language processing system includes an automatic speech recognition (“ASR”) module that receives audio input of a user utterance and generates one or more likely transcriptions of the utterance. Automatic speech recognition modules typically include an acoustic model and a language model. The acoustic model is used to generate hypotheses regarding which subword units (e.g. phonemes or triphones) correspond to an utterance based on the acoustic features of the utterance. The language model is used to determine which of the hypotheses generated using the acoustic model is the most likely transcription of the utterance based on lexical features of the language in which the utterance is spoken.
Automatic speech recognition often uses acoustic models based on Gaussian mixture models (“GMMs”). GMMs are commonly used models of probability density functions for features used in speech recognition. GMMs are used to score portions of utterance audio to determine which subword units were likely uttered. For example, each GMM may be associated with a particular subword unit. Each GMM includes various individual components (Gaussians) associated with the different ways in which the subword unit may be spoken. Some automatic speech recognition systems may use acoustic models with hundreds of thousands of Gaussians in total (e.g., 100,000-200,000 Gaussians). During speech recognition, features are computed from portions of utterance audio and scored against each Gaussian to determine the subword unit to which the portion of audio likely corresponds.