Models representing data relationships and patterns, such as functions, algorithms, systems, and the like, may accept input (sometimes referred to as an input vector), and produce output (sometimes referred to as an output vector) that corresponds to the input in some way. In some implementations, a model is used to generate a likelihood (or set of likelihoods) that the input corresponds to a particular value. For example, an automatic speech recognition (“ASR”) system may utilize various models to recognize speech, such as an acoustic model and a language model. The acoustic model is used on features of audio data to generate hypotheses regarding which words or subword units (e.g., phonemes) correspond to an utterance captured in the audio data. The language model is used to determine which of the hypotheses generated using the acoustic model is the most likely transcription of the utterance. As another example, natural language understanding (“NLU”) systems typically include models for named entity recognition, intent classification, and the like. The natural language understanding models can be used to determine an actionable intent from the words that a user speaks or writes.
The parameters of models can be set in a process referred to as “training.” Models can be trained using training data that includes input data and the correct or preferred output of the model for the corresponding input data. The model can be used to process the input data, and the parameters of the model can be modified until the model produces (or “converges” on) the correct or preferred output. One method of training models uses the stochastic gradient descent technique. In stochastic gradient descent, a modification to each parameter of a model is based on the error in the output produced by the model. A derivative, or “gradient,” can be computed that corresponds to the direction in which each individual parameter of the model is to be adjusted in order to improve the model output (e.g., to produce output that is closer to the correct or preferred output for a given input). In stochastic gradient descent, the gradient is computed and applied for a single input vector at a time, or aggregated from a small number of input vectors, rather than the entire set of training data. Therefore, the gradient may be referred to as a “partial gradient” because it is not based on the entire corpus of training data. Instead, it is based on the error that occurs when processing only a particular subset of the training data. Subsets or “mini-batches” of training data can be processed iteratively and the model parameters can be iteratively updated until the output converges on the correct or preferred output.