Computer systems are currently in wide use. Some such computer systems receive input signals indicative of various patterns, and generate a pattern recognition result indicative of one or more patterns recognized in the input. By way of example, some computer systems include speech processing systems, such as speech recognition systems, that receive an audio signal and recognize speech in the audio signal. The speech can be transcribed, for instance, into text. Other computer systems include handwriting recognition systems that receive an input signal indicative of a handwriting character. For instance, the input signal may indicate pixels of a touch sensitive screen that were activated based on a user's touch input on the touch sensitive screen. The input is subjected to handwriting recognition where a character is recognized based on the input. Other computing systems can include, for instance, image recognition systems (such as facial recognition systems, finger print recognition systems, etc.).
Some computing systems that are used in pattern recognition can deploy neural networks (or artificial neural networks). Such networks have an interconnected set of nodes (or neurons) that exchange messages with each other. The connections have numeric weights which indicate the strength of connection between nodes. The weights can be tuned and therefore the neural networks are capable of learning.
During recognition, a set of features (such as a feature vector) is extracted from an input signal representing an input. The features are applied to the neural network to activate a first set of nodes (e.g., an input level) in the neural network. The feature values are weighted and transformed by a function, and then passed to another level in the neural network (which represents another set of nodes). This continues until an output neuron (or node) is activated that corresponds to a pattern (e.g., a speech unit, a handwriting character, etc.) represented in the input signal.
Deep neural networks (DNNs) are neural networks with a relatively large number of levels (or multiple layers of nodes) between the input and output layers. Deep neural networks are thus a powerful tool for modeling complex non-linear relationships. Therefore, they are powerful for performing many character recognition tasks, such as large vocabulary speech recognition. By way of example, some speech recognition systems employ deep neural network acoustic models using millions of parameters. The deeper networks can represent certain function classes better than shallower networks, and the use of deep networks can offer both computational and statistical efficiency for complex tasks.
Training a deep neural network with a large number of layers, however, can be difficult. This is because, during training, the training system attempts to attribute error values to the different parameters in the model using back propagation. This is often done by computing the derivative of the error function with respect to the parameters. Since the activation functions in the neural network often include a compressive non-linear component, this leads to a compression of the error gradient that propagates through that non-linearity. The compression increases with the number of levels, through which the error gradient is propagated, in the neural network. The gradient thus vanishes exponentially with the number of layers it passes through causing training to slow considerably.
To address this gradient vanishing problem, some have attempted to perform unsupervised pre-training to help train deep networks with improved parameter initialization. Others have attempted to change the loss function by introducing batch normalization to the individual hidden layers, in addition to the overall objective at the output layer. Training and evaluating a deep neural network, using these techniques, can consume a great deal of computational overhead, resulting in undesirably high computation costs.