Computing devices can be used to process a user's spoken commands, requests, and other utterances into written transcriptions. Models representing data relationships and patterns, such as functions, algorithms, systems, and the like, may accept audio input (sometimes referred to as an input vector), and produce output (sometimes referred to as an output vector) that corresponds to the input in some way. In some implementations, a model is used to generate a likelihood or set of likelihoods that the input corresponds to a particular value. For example, an automatic speech recognition (“ASR”) module may utilize various models to recognize speech, such as an acoustic model and a language model. The acoustic model is used on features of audio data to generate hypotheses regarding which words or subword units (e.g., phonemes) correspond to an utterance captured in the audio data. The language model is used to determine which of the hypotheses generated using the acoustic model is the most likely transcript of the utterance.
ASR modules commonly utilize Gaussian mixture models/hidden Markov models (“GMM/EIMM”) for vocabulary tasks. However, artificial neural networks (“NN”), including deep neural networks, may also be used. Acoustic scores in NN-based ASR modules are obtained by doing an NN forward pass. The forward pass involves multiplying large trained NN weight matrices, representing the parameters of the model, with vectors corresponding to feature vectors or hidden representations. The output can be used to determine which subword unit (e.g., phoneme, phoneme portion, or triphone) is most likely to correspond to the input feature vector.
The parameters of an acoustic model can be set in a process referred to as training. An acoustic model can be trained using training data that includes input data and the correct or preferred output of the model for the corresponding input data. The model can be used to process the input data, and the parameters of the model can be modified until the model produces (or “converges” on) the correct or preferred output.