Computing devices can use models representing data relationships and patterns, such as functions, algorithms, systems, and the like, to process input (sometimes referred to as an input vector), and produce output (sometimes referred to as an output vector) that corresponds to the input in some way. In some implementations, a model is used to generate a likelihood or set of likelihoods that the input corresponds to a particular value. For example, artificial neural networks (“NNs”), including deep neural networks (“DNNs”), may be used to model speech (e.g., a NN-based acoustic model).
NNs generate scores, such as acoustic scores, by doing a forward pass. The forward pass involves progressing through the layers of the NN by multiplying large trained NN weight matrices, representing the parameters of the model, with vectors corresponding to input feature vectors or intermediate layer representations. The NN output can be used to determine which subword unit (e.g., phoneme, phoneme portion, or triphone) is most likely to correspond to an input feature vector.
Some NNs, such as convolutional neural networks, use a technique referred to as “max pooling” in which multiple values are generated at a given layer, and the maximum values are forwarded to the next layer. For example, a weight matrix of a lower dimension than the vector being processed may be applied to the vector using a sliding window technique in which the matrix is repeatedly applied to different portions of the vector. Individual values of the vector are multiplied using two or more different portions of the weight matrix, thus generating multiple candidate values for each dimension of the vector from which the best (e.g., maximum) value can be passed to the next layer.