Models representing data relationships and patterns, such as functions, algorithms, systems, and the like, may accept input (sometimes referred to as an input vector), and produce output (sometimes referred to as an output vector) that corresponds to the input in some way. In some implementations, a model is used to generate a likelihood or set of likelihoods that the input corresponds to a particular value. For example, an automatic speech recognition (“ASR”) system may utilize various models to recognize speech, such as an acoustic model and a language model. The acoustic model is used on acoustic features of audio data to generate hypotheses regarding which words or subword units (e.g., phonemes) correspond to an utterance represented by the audio data. The language model is used to determine the most likely transcription of the utterance based on the acoustic model hypotheses and the features of the language modelled by the language model.
Some ASR systems use artificial neural networks (“NNs”), including deep neural networks (“DNNs”), to model speech (e.g., a NN-based acoustic model or language model). The neural networks are artificial in the sense that they are computational entities, analogous to biological neural networks in animals, but implemented by computing devices. Scores in NN-based models are obtained by doing a “forward pass.” The forward pass involves multiplying large NN weight matrices, representing the parameters of the model, by vectors corresponding to feature vectors or hidden intermediate representations. In speech processing systems, NNs may generate scores, such as language model scores, via the forward pass. In such implementations, the NN output can be used to determine the most likely transcription of an utterance.
The parameters of a NN can be set in a process referred to as training. For example, a NN-based language model can be trained using training data that includes input data and the correct or preferred output of the model for the corresponding input data. The NN can repeatedly process the input data, and the parameters (e.g., the weight matrices) of the NN can be modified in what amounts to a trial-and-error process until the model produces (or “converges” on) the correct or preferred output. Illustratively, a correct output of a NN-based language model would be the correct transcription of an utterance represented by the input data. The modification of weight values may be performed through a process referred to as “back propagation.” Back propagation includes determining the difference between the expected model output and the obtained model output, and then determining how to modify the values of some or all parameters of the model to reduce the difference between the expected model output and the obtained model output.