Computing devices can use models representing data relationships and patterns, such as functions, algorithms, systems, and the like, to process input (sometimes referred to as an input vector), and produce output (sometimes referred to as an output vector) that corresponds to the input in some way. In some implementations, a model is used to generate a likelihood or set of likelihoods that the input corresponds to a particular value. For example, artificial neural networks (“NNs”), including deep neural networks (“DNNs”), may be used to model speech such as via a NN-based acoustic model, predict the likelihood that a customer will purchase a product to determine what products should be recommended to the customer, recognize features included in an image such as faces or shapes, and the like. NN models can be useful for solving problems that are difficult to solve using rule-based models, such as pattern-recognition, speech processing, natural language understanding, face recognition, etc. The neural networks are artificial in the sense that they are computational entities implemented in hardware and, in some instances, software but mimic the biological neural networks in animals. The nodes of the artificial neural network compute an output based on one or more input values. When analogized to the nervous system, the inputs values mirror a stimulus while the output mirrors a response.
Scores in NN-based models are obtained by doing an NN forward pass. The forward pass involves multiplying large trained NN weight matrices, representing the parameters of the model, with vectors corresponding to feature vectors or hidden representations/nodes. The NN may progress from lower level structures to higher level structures. For example, for a NN trained to recognize faces in images, the input of the NN can comprise pixels. A lower level of the NN may recognize pixel edges, a higher level may identify parts of objects, such as eyes, noses, ears, etc., and an even higher level may recognize a face (or other object). In speech processing systems, NNs may generate scores, such as acoustic scores via the forward pass. In such implementations, the NN output can be used to determine which sub-word unit (such as a phoneme, phoneme portion, or triphone) is most likely to correspond to an input feature vector. The resulting models can be transmitted to recognition or prediction systems and used to predict one or more values for a user input such as an image or utterance.
The parameters of a model can be set in a process referred to as training. For example, a model can be trained using customer data that includes input data and the correct or preferred output of the model for the corresponding input data. The model can be used to process the input data, and the parameters of the model can be modified until the model produces (or “converges” on) the correct or preferred output. For instance, a correct output of an image recognition model would be the generating an output that identifies the subject included in the image.). This allows the model to evolve by adjusting the weight values to affect the output for one or more hidden nodes. The changing of weight values may be performed through a variety of methods such as random weight updates or backward propagation sometimes referred to as “back propagation”. Back propagation includes comparing the expected model output with the obtained model output and then traversing the model to determine the difference between the expected node output that produces the expected model output and the actual node output. An amount of change for one or more of the weight values may be identified using this difference such that the model output more closely matches the expected output.