Unless otherwise indicated herein, the materials described in this section are not prior art to the claims in this application and are not admitted to be prior art by inclusion in this section.
Computerized speech recognition can be broken down into a series of procedures. One procedure is to convert a stream of “acoustic features”, or sampled and filtered speech data, to a stream of phonemes, which are then recognized as words.
Each acoustic feature can represent one or more samples of speech. For example, a fixed duration of speech can be sampled at fixed intervals of time; e.g. every 10-30 milliseconds. The sample can be transformed into a set of mel-frequency centered cepstral coefficients (MFCC) using well-known techniques. The set of MFCC coefficients corresponding to one sample of speech can be considered to be one acoustic feature.
The acoustic features can be provided to a neural network for derivation of probability models of speech corresponding to the features. Once trained, the neural network can take later acoustic features as input, recognize the acoustic features as spoken speech, and generate text and/or audible output(s) corresponding to the recognized spoken speech.
The current state-of-the-art involves training neural networks with a Stochastic Gradient Descent (SGD) procedure. SGD involves presentation of small batches of training samples to a neural network being trained, and updating the probability models in response to a gradient derived from the error made when comparing the neural network's output to a target output. After a first batch of training samples are presented to the neural network, the first batch are placed back in a pool of training samples, and the pool is resampled to get a second batch of training sample for presentation to the neural network. In some scenarios, the first and second batch can include some or all of the same training samples.
Using SGD to train a neural network can involve presentation of billions of samples of speech multiple times, and can take a long time even with parallelized processing and/or using fast processors. Additionally, SGD and other gradient methods risk getting stuck into “local optima” which can result in degraded accuracy compared to other solutions that could be achieved.