Computing devices can be used to process a user's spoken commands, requests, and other utterances into written transcriptions. Models representing data relationships and patterns, such as functions, algorithms, systems, and the like, may accept audio data input (sometimes referred to as an input vector), and produce output (sometimes referred to as an output vector) that corresponds to the input in some way. In some implementations, a model is used to generate a probability or set of probabilities that the input corresponds to a particular language unit (e.g., phoneme, phoneme portion, triphone, word, n-gram, part of speech, etc.). For example, an automatic speech recognition (“ASR”) system may utilize various models to recognize speech, such as an acoustic model and a language model. The acoustic model is used to generate hypotheses regarding which words or subword units (e.g., phonemes) correspond to an utterance based on the acoustic features of the utterance. The language model is used to determine which of the hypotheses generated using the acoustic model is the most likely transcription of the utterance.
ASR systems commonly utilize Gaussian mixture models/hidden Markov models (“GMMs/HMMs”) to tag language units in sequences of natural language input. However, artificial neural networks may also be used. Scores in neural-network-based ASR systems are obtained by multiplying trained weight matrices, representing the parameters of the model, with vectors corresponding to feature vectors or intermediate representations within the neural network. This process is referred to as a forward pass. The output can be used to determine which language unit most likely corresponds to the input feature vector. Due to the sequential nature of spoken language, the correct output for a feature vector for a particular frame of audio data may depend upon the output generated for a feature vector for the sequentially previous frame of audio data. Some systems incorporate sequential aspects of language by defining input feature vectors as a function of a contiguous span of observations (e.g., based on the feature vectors for the current frame and the previous n frames of audio data).