Many researchers have recently proposed architectures for very large scale integration (VLSI) implementation of a type of neural network called a multi-layer perceptron in which the "training" is performed on-chip. A technique known as "back-propagation" has been proposed for use in respect of digital and analogue implementations to train the network.
Back-propagation is a "supervised" training technique. This means than to train a network to recognise an input pattern, the "expected" output of the network associated with the input pattern needs to be known.
Back-propagation trains a network by calculating modifications in the strength factors of individual synapses in order to minimise the value of: half of the sum of the square of the differences between the network output and the "expected" output (the total mean squared error or TMSE). The minimisation process is performed using the gradient of the TMSE with respect to the strength factor being modified (gradient descent). Although the gradient with respect to the strength factors in the output layer (synapses connected to the output neurons) can be easily calculated, the gradient with respect to strength factors in the hidden layers is more difficult to evaluate. Back-propagation offers an analytical technique that basically propagates the error backward through the network from the output in order to evaluate the gradient, and therefore to calculate the required strength factor modifications.
Analog implementation of back propagation requires bi-directional synapses, which are expensive, and the generation of the derivative of neuron transfer functions with respect to their input, which is difficult.
The Madaline Rule III has been suggested as a less expensive alternative to back-propagation for analog implementation. This rule evaluates the required derivatives using "node perturbation". This means that each neuron is perturbated by an amount .DELTA.net.sub.i, which produces a corresponding change in the TMSE. The change in value of the required strength factor .DELTA.w.sub.ij is estimated by the following equation: ##EQU1## where
.DELTA.E=E.sub.pert -E, i.e., the difference between the mean squared errors produced at the output of the network for a given pair of input and training signals when a node is perturbated (E.sub.pert) and when it is not (E); EQU net.sub.i =.SIGMA..sub.j w.sub.ij x.sub.j ;
x.sub.j =f(net.sub.j) where f is the non-linear squashing function; and
.eta. is a constant.
In addition to the hardware needed for the operation of the network, the implementation of the Madaline Rule III training for a neural network having N neurons in analog VLSI requires: an addressing module and wires routed to select and deliver the perturbations to each of the N neurons; multiplication hardware to compute the term ##EQU2## N times (if one multiplier is used then additional multiplexing hardware is required); and an addressing module and wires routed to select and read the x.sub.j terms.
If off-chip access to the gradient values is required, then the states of the neurons (x.sub.j) need to be made available off-chip as well, and this will require a multiplexing scheme and N chip pads.