Field of Invention
The present invention relates generally to the field of artificial neural networks (ANNs). More specifically, the present invention is related to neuron-centric local learning rate for artificial neural networks to increase performance and learning rate margin, and reduce power consumption.
Discussion of Related Art
Artificial neural networks (ANNs) are distributed computing systems, which consist of a number of neurons interconnected through connection points called synapses. Each synapse encodes the strength of the connection between the output of one neuron and the input of another. The output of each neuron is determined by the aggregate input received from other neurons that are connected to it, and thus by the outputs of these “upstream” connected neurons and the strength of the connections as determined by the synaptic weights. The ANN is trained to solve a specific problem (e.g., pattern recognition) by adjusting the weights of the synapses such that a particular class of inputs produces a desired output. The weight adjustment procedure is known as “learning.” There are many algorithms in the ANN literature for performing learning that are suitable for various tasks such as image recognition, speech recognition, language processing, etc. Ideally, these algorithms lead to a pattern of synaptic weights that, during the learning process, converges toward an optimal solution of the given problem.
An attractive implementation of ANNs uses some (e.g., CMOS) circuitry to represent the neuron, the function of which is to integrate or sum the aggregate input from upstream neurons to which a particular neuron is connected, and apply some nonlinear function of the input to derive the output of that neuron. Because in general, each neuron is connected to some large fraction of the other neurons, the number of synapses (connections) is much larger than the number of neurons; thus it is advantageous to use some implementation of synapses that can achieve very high density on a neuromorphic computing chip. One attractive choice is a non-volatile memory (NVM) technology such as resistive random access memory (RRAM) or phase-change memory (PCM). Since both positive and negative (i.e., excitatory and inhibitory) weights are desired, one scheme uses a pair of NVM to represent the weight as the difference in conductance between the two (see M. Suri et al., IEDM Tech. Digest, 4.4 (2011)). This scheme is shown in FIG. 1. The outputs of the upstream Ni neurons are summed in parallel through pairs of NVM conductances into the positive and negative inputs of the downstream Mi neurons. This parallelism is highly advantageous for efficient computation.
During learning, the conductances of the NVM elements are programmed by sending them pulses that can either increase or decrease the conductance according to a learning rule. One common learning rule is backpropagation (see D. Rumelhart et al., Parallel Distributed Processing, MIT Press (1986)), which is used extensively in deep learning networks that are currently being implemented on graphical processing units (GPU's) for image recognition, learning to play video games, etc. The backpropagation algorithm calls for a weight update Δwij=η·xi·δj that is proportional to the product of the output of the upstream neuron, xi, and the error contribution from the downstream neuron, δj, with the proportionality constant, η, known as the learning rate. When an array of NVM is used to represent the weights as in FIG. 1, the advantage of parallel programming of weight updates during learning can be maintained if the pulses sent to the NVM by the Ni are determined solely by xi, and the pulses sent by the Mj are determined solely by δj. There have been several schemes proposed for this. We have implemented one in which the magnitude of xi and δj are encoded by the number and timing of programming pulses sent by the upstream and downstream neurons in such a way that the number of pulses from both directions that overlap in time approximates the product η·xi·δj. The selector devices in series with the NVM elements ensure that only pairs of pulses that overlap in time are effective in programming the NVM conductance.
To show that this “crossbar-compatible” learning rule is as effective as the conventional backpropagation rule, we have performed simulations of a three-layer perceptron network applied to the recognition of the handwritten digit images in the MNIST database. These simulations show that this learning rule gives equivalent network performance (see FIG. 2), when the conductance response of the NVM is linear and unbounded, i.e., when every pulse produces a constant change in conductance, and the maximum conductance does not saturate. FIG. 2 compares the simulated performance of the ANN using the usual direct weight updates (dashed curves) with the pulsed, crossbar-compatible weight update method, which allows parallel weight update, and shows that equivalent performance is achieved. The same results are shown on a linear vertical scale, and in the inset, with a stretched vertical scale that better displays the part of the results that are close to 100%. The curves show performance during training, the star shows generalization performance.
Any real NVM element has a non-ideal response. It is nonlinear and has a limit to the maximum conductance it can achieve. The conductance change to a pulse designed to increase conductance is different than that of a pulse designed to decrease conductance, i.e., the response is asymmetric. There are variations among devices, some devices will be inoperable, either stuck in a high conductance state or stuck in a low conductance state. Our work has shown that many of these defects cause very little decrease in ANN performance. However, nonlinearity, bounded conductance and asymmetric response cause a reduction in accuracy for the MNIST digit recognition problem from 99+% accuracy during training to something between 80% and 85%. Worse, the range of learning rate that can achieve this performance is much narrower than for a more ideal response; thus parameters such as the learning rate and the slope of neuron response function must be carefully tuned to achieve even this reduced performance. The proposed mechanism for this is as follows: During training, many different inputs are presented to the network and the backpropagation learning rule is used to update the NVM conductances after each (or after some small number of inputs, called a minibatch). Some weights in the network tend to evolve steadily toward some stable value, while others tend to dither up and down, sometimes increasing, other times decreasing. When the NVM response is nonlinear or asymmetric, the response to a pulse intended to decrease the weight absolute value will be stronger usually than one intended to increase the weights. This tends to push many of these weights near to zero, and this makes the backpropagation learning rule ineffective. This phenomenon, which is a problem the prior art fails to remedy, is referred to as a network freeze-out.
Embodiments of the present invention are an improvement over prior art systems and methods.