Computing devices can be used to process a user's spoken commands, requests, and other utterances into written transcriptions. Models representing data relationships and patterns, such as functions, algorithms, systems, and the like, may accept audio data input (sometimes referred to as an input vector), and produce output (sometimes referred to as an output vector) that corresponds to the input in some way. In some implementations, a model is used to generate a probability or set of probabilities that the input corresponds to a particular language unit (e.g., phoneme, phoneme portion, triphone, word, n-gram, part of speech, etc.). For example, an automatic speech recognition (“ASR”) system may utilize various models, such as an acoustic model and a language model, to recognize speech. The acoustic model is used to generate hypotheses regarding which words or subword units (e.g., phonemes) correspond to an utterance based on the acoustic features of the utterance. The language model is used to determine which of the hypotheses generated using the acoustic model is the most likely transcription of the utterance.
ASR systems commonly utilize Gaussian mixture models (“GMMs”) to model acoustic input. Features can be extracted from an utterance in the form of feature vectors, which include one or more numbers that describe the audio input. The feature vectors can be processed using GMMs to determine the most likely word or subword unit that was spoken and that resulted in the corresponding feature vectors extracted. Typically, GMMs are trained, using training data, to maximize the likelihood of the correct words or subword units corresponding to the feature vectors of the training data.