The roots of the current work on neural networks (or models) can be found in a 1943 paper by McCulloch and Pitts, W. S. McCulloch and W. H. Pitts, "A logical calculus of ideas immanent in nervous activity", Bulletin of Mathematical Biophysics, 5, 115 (1943). There the brain is modeled as a collection of neurons with one of two states, s.sub.i =0 (not firing) or s.sub.i =1 (firing at maximum rate). If there is a connection from neuron i to neuron j, the strength or weight of this connection is defined as w.sub.ij. Each neuron adjusts its state asynchronously according to the threshold rule: ##EQU1## where .theta..sub.i is the threshold for neuron i to fire.
A model of this sort formed the basis for the perception built by Rosenblatt in the early 1960s, F. Rosenblatt, "Principals of Neurodynamics: Perceptrons and the theory of brain mechanisms", Spartan Books, Washington, D.C. (1961). This perceptron consisted of an input array hard-wired to a set of feature detectors whose output can be an arbitrary function of the inputs. These outputs were connected through a layer of modifiable connection strength elements (adjustable resistors) to threshold logic units, each of which decides whether a particular input pattern is present or absent. The threshold logic units of this machine can be implemented in hardware by using a bistable device such as a Schmitt trigger, or a high-gain operational amplifier. There exists an algorithm, the perceptron convergence procedure, which adjusts the adaptive weights between the feature detectors and the decision units (or threshold logic units). This procedure is guaranteed to find a solution to a pattern classification problem, if one exists, using only the single set of modifiable weights. Unfortunately, there is a large class of problems which perceptrons cannot solve, namely those which have an order of predicate greater than 1. The Boolean operation of exclusive-or has order 2, for example. Also the perceptron convergence procedure does not apply to networks in which there is more than one layer of modifiable weights between inputs and outputs, because there is no way to decide which weights to change when an error is made. This is the so-called "credit assignment" problem and was a major stumbling block until recent progress in learning algorithms for multi-level machines.
Rosenblatt's perceptron consisted of a bank of 400 photocells each of which looked at a different portion of whatever pattern was presented to it. The photocells were connected to a bank of 512 neuron-like association units which combined signals from several photocells and in turn relayed signals to a bank of threshold logic units. The threshold logic units correlated all of the signals and made an educated guess at what pattern or letter was present. When the machine guessed right, the human operator left it alone, but when it guessed wrong the operator re-adjusted the circuits electrical connections. The effect of repeated readjustments was that the machine eventually learned which features characterized each letter or pattern. That machine thus was manually adaptive and not self-adaptive.
Another seminal idea in neural or brain models also published in the 1940s was Hebb's proposal for neural learning, D. O. Hebb, "The Organization of Behavior", Wiley, N.Y. (1949). This states that if one neuron repeatedly fires another, some change takes place in the connecting synapse to increase the efficiency of such firing, that is, the synaptic strength or weight is increased. This correlational synapse postulate has in various forms become the basis for neural models of distributed associative memory found in the works of Anderson, J. A. Anderson, J. W. Silverstein, S. A. Ritz, and R. S. Jones, "Distinctive features, categorical perception, and probability learning: Some applications of a neural model", Psych. Rev. 84, 413-451 (1977); and Kohonen, T. Kohonen, "Associative memory--A system-theoretic approach", Springer-Verlag, Berlin (1977).
Various neural transfer functions have been used in neural models. The all-or-none McCulloch-Pitts neuron is represented by a step at the threshold and can be implemented by any one of several bistable (or binary) electronic circuits. A real (or biological) neuron exhibits a transfer function comprising two horizontal lines representing zero and maximum output, connected by a linear sloping region. This characteristic is often represented by a sigmoid function shown for example in S. Grossberg, "Contour enhancement, short term memory, and constancies in reverberating neural networks, in Studies in Applied Mathematics, LII, 213, MIT Press, (1973); and T. J. Sejnowski, "Skeleton Filters in the brain", in "Parallel Models of Associative Memory", G. Hinton and J. A. Anderson (eds.), Erlbaum, Hillsdale, N.J., 189-212 (1981). An operational amplifier can be designed to have a transfer function close to the sigmoid.
Recent activity in neural network models was stimulated in large part by a non-linear model of associative memory due to Hopfield, J. J. Hopfield, "Neural networks and physical systems with emergent collective computational abilities", Proc. Natl. Acad. Sci. USA, 79, 2554-2558 (1982). These neurons are an all-or-none, i.e. bistable or two-state type with a threshold assumed to be zero. Memories, labeled k, are stored in the outer product sum over states: ##EQU2## where the (2s-1) terms have the effect of transforming the (0,1) neural states to (-1,1) states. It is apparent that for a particular memory, s.sup.1, ##EQU3## The first summation term has mean value (N-1)/2 for the j terms summed over N neurons while the last term in brackets has a mean value of zero for random (and therefore pseudoorthogonal) memories when the sum over M memories (label k) is taken. Thus: ##EQU4## Since this is positive (&gt;.theta..sub.i =0) if s.sub.i.sup.1 =1 and negative if s.sub.i.sup.1 =0, the state does not change under the threshold rule and is stable except for the statistical noise coming from states k.noteq.1, which has a variance of EQU [(M-1)(N-1)/2].sup.1/2.
Hopfield's proposed neural network is fully connected and symmetric. This means that every neuron is connected to every other neuron by means of direct and reciprocal synapses of equal strengths or weights. Thus for every pair of neurons, i and j, w.sub.ij =w.sub.ji, but w.sub.ii =0. Using an analogy from physics, namely the Ising model of a spin-glass, S. Kirkpatrick and D. Sherrington, "Infinite-ranged models of spin-glasses", Phys. Rev. 17, 4384-4403 (1978), we can define an "energy" or "cost", E; as ##EQU5## If one neuron, s.sub.k, changes state, the energy change is; ##EQU6## By the threshold rule, this change could only have occurred if the sign of the summation term were the same as the sign of .DELTA.s.sub.k. Therefore, all allowed changes decrease E and gradient descent is automatic until a local minimum is reached. This energy measure is an example of a class of systems with global Liapunov functions which exhibit stability under certain conditions, M. A. Cohen and S. Grossberg, "Absolute stability of global pattern formation and parallel memory storage by competitive neural networks", Trans. IEEE SMC-13, 815, (1983). The neural states at these minima represent the memories of the system. This is a dynamical system which in the process of relaxation, performs a collective computation.
Integrated circuits implementing this type of associative memory have been made by groups at the California Institute of Technology, M. Silviotti, M. Emerling, and C. Mead, "A Novel Associative Memory Implemented Using Collective Computation", Proceedings of the 1985 Chapel Hill Conference on Very Large Scale Integration, p. 329; and at AT&T Bell Laboratories, H. P. Graf et al., "VLSI Implementation of a Neural Network Memory with Several Hundreds of Neurons", Proceedings of the Conference on Neural Networks for Computing, p. 182, Amer. Inst. of Phys., 1986. A system of N neurons has 0(N/logN) stable states and can store about 0.15N memories (N.apprxeq.100) before noise terms make it forget and make errors. Furthermore, as the system nears capacity, many spurious stable states also creep into the system, representing fraudulent memories. The search for local minima demands that the memories be uncorrelated, but correlations and generalizations therefrom are the essence of learning. A true learning machine, which is the goal of this invention, must establish these correlations by creating "internal representations" and searching for global (i.e. network-wide) minima, thereby solving a constraint satisfaction problem where the weights are constraints and the neural units represent features.
Perceptrons were limited in capability because they could only solve problems that were first order in their feature analyzers. If however extra (hidden) layers of neurons are introduced between the input and output layers, higher order problems such as the Exclusive-Or Boolean function can be solved by having the hidden units construct or "learn" internal representations appropriate for solving the problem. The Boltzmann machine has this general architecture, D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, "A learning algorithm for Boltzmann machines", Cognitive Science 9, 147-169 (1985). A Boltzmann machine is a neural network (or simulation thereof) which uses the Boltzmann algorithm to achieve learning. In the Boltzmann machine, unlike the strictly feed forward nature of the perceptron, connection between neurons runs both ways and with equal connection strengths, i.e. the connections are symmetric, as in the Hopfield model. This assures that the network can settle by gradient descent in the energy measure. ##EQU7## where .theta..sub.i are the neuron thresholds. These threshold terms can be eliminated by assuming that each neuron is connected to a permanently "on" true unit by means of a connection of strength w.sub.i true =-.theta..sub.i to neuron i. Thus the energy may be restated as; ##EQU8## while the energy gap or difference between a state with neuron k "off" and with the same neuron "on" is ##EQU9##
Instead of a deterministic threshold, neurons in the Boltzmann machine have a probabilistic rule such that neuron k has state s.sub.k -1 with probability; ##EQU10## where T is a parameter which acts like temperature in a physical system. The output of the neuron is always either 0 or 1, but its probability distribution is sigmoid, so, on the average its output looks like the sigmoid. Note that as T approaches 0, this distribution reduces to a step (on-off) function. This rule allows the system to jump occasionally to a higher energy configuration and thus to escape from local minima. This machine gets its name from the mathematical properties of thermodynamics set forth by Boltzmann.
While the Hopfield model uses local minima as the memories of the system, the Boltzmann machine uses simulated annealing to reach a global, network-wide energy minimum since the relative probability of two global states A and B follows the Boltzmann distribution; ##EQU11## and thus the lowest energy state is most probable at any temperature. Since, at low temperatures, the time to thermal equilibrium is long, it is advisable to anneal by starting at high temperature and gradually reduce it. This is completely analogous to the physical process of annealing damage to a crystal where a high temperature causes dislocated atoms to jump around to find their lowest energy state within the crystal lattice. As the temperature is reduced the atoms lock into their proper places within the lattice. The computation of such annealing is complex and time-consuming for two reasons. First, the calculation involves imposing probability distributions and physical laws in the motions of particles. Second, the computations are serial. A physical crystal's atoms naturally obey physical laws without calculation and they obey these laws in parallel. For the same reasons the Boltzmann machine simulations on computers are also complex and time-consuming, since they involve the use of Eq. (10) to calculate the "on" probability of neurons. The present invention utilizes physical noise mechanisms to jitter or perturb the "on" probability of the electronic neurons.
The "credit assignment" problem that blocked progress in multi-layer perceptrons can be solved in the Boltzmann machine framework by changing weights in such a way that only local information is used. The conventional Boltzmann learning algorithm works in two phases. In phase "plus" the input and output units are clamped to a particular pattern that is desired to be learned while the network relaxes to a state of low energy aided by an appropriately chosen annealing schedule. In phase "minus", the output units are unclamped and the system also relaxes to a low energy state while keeping the inputs clamped. The goal of the learning algorithm is to find a set of synaptic weights such that the "learned" outputs in the minus phase match the desired outputs in the plus phase as nearly as possible. The probability that two neurons i and j are both "on" in the plus phase, P.sub.ij.sup.+, can be determined by counting the number of times they are both activated averaged across some or all patterns (input-output mappings) in the training set. For each mapping, co-occurrence statistics are also collected for the minus phase to determine P.sub.ij.sup.-. Both sets of statistics are collected at thermal equilibrium, that is, after annealing. After sufficient statistics are collected, the weights are then updated according to the relation; EQU .DELTA.w.sub.ij =n(P.sub.ij.sup.+ -P.sub.ij.sup.-) (12)
where n scales the size of each weight change.
It can be shown that this algorithm minimizes an information theoretic measure of the discrepancy between the probabilities in the plus and minus states. It thus teaches the system to give the desired outputs. An important point about this procedure is that it uses only locally available information, the states of two connected neurons, to decide how to update the weight of the synapse connecting them. This makes possible a (VLSI) very large scale integrated circuit implementation where weights can be updated in parallel without any global information and yet optimize a global measure of learning.
Recently a promising deterministic algorithm for feedforward neuron networks has been found which takes less compute time for solving certain problems, D. E. Rumelhart, G. E. Hinton and R. J. Williams, "Learning internal representations by error propagation", in Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1: Foundations, D. E. Rumelhart and J. L. McClelland (eds.), MIT Press, Cambridge, MA (1986), p. 318. This algorithm also uses a generalization of the perceptron convergence procedure in a variation due to Widrow and Hoff called the delta rule, G. Widrow and M. E. Hoff, "Adaptive switching circuits", Inst. of Radio Engineers, Western Electric Show and Convention, Convention Record, Part 4, 96-104 (1960). This rule is applied to layered feedforward networks in which only one way or forward synapses connect adjacent layers of the network. The neurons have a graded semi-linear transfer function similar to a sigmoid wherein the output, .sigma., is a differentiable function of the total input to the neuron. This algorithm involves first propagating the input training pattern forward to compute the values of .sigma..sup.-. The output is then compared to the target outputs .sigma..sup.+ to yield an error signal, .sigma., for each output unit. The error signals are then recursively propagated backward, with the synaptic weights changed accordingly. This backward error propagation will result in learning.
Both the Boltzmann and the back-propagation procedures learn. They both create the internal representations required to solve a problem by establishing hidden units as features and connections strengths as contraints. Then, by doing a global search of a large solution space, they solve the problem. While a back-propagation procedure is computationally more efficient than the Boltzmann algorithm, it is not as suitable for VLSI implementation. Firstly, in the back-propagation procedure, except for the weights feeding the final output layer of neurons, adjusting of weights requires non-local information that must be propagated down from higher layers. This necessitates synchrony and global control and would mean that weight processing could not be a parallel operation. Secondly, the network must be specified in advance as to which units are input, hidden, and output because there would have to be special procedures, controls, and connections for each layer as well as different error formulae to calculate. Thirdly, the deterministic algorithm has some unaesthetic qualities. The weights could not start at zero or the hidden units will be identical error signals from the outputs so that the weights cannot grow unequal. This means that the system must first be seeded with small random weights. This also means that if no error is made, no learning takes place. Additionally, a deterministic algorithm may be more likely to get stuck in local minima. Finally, there is no clear way to specify at what activation level a neuron is "on", or what should be the output target value without a real threshold step for the output. A real-valued floating point comparison and its backward propagation is quite difficult to implement in a parallel VLSI system although it could be accomplished by having separate specialized units for that task.
In contrast, the Boltzmann algorithm uses purely local information for adjusting weights and is suitable for parallel asynchronous operation. The network looks the same everywhere and need not be specified in advance. The neurons have two stable states, ideal for implementation in digital circuitry. The stochastic nature of the computation allows learning to take place even when no error is made and avoids getting stuck in local minima. Finally, the processes in the algorithm which take so much time on a conventional digital, serial computer are annealing and settling to equilibrium, both of which can be implemented efficiently and naturally on a chip using the physical properties of analog voltages rather than digital computation.
Prior art patents in this field include the Hiltz U.S. Pat. No. 3,218,475, issued on Nov. 16, 1965. This patent discloses an on-off type of artificial neuron comprising an operational amplifier with feedback. The Jakowatz U.S. Pat. No. 3,273,125, issued on Sept. 13, 1966 discloses a self-adapting and self-organizing learning neuron network. This network is adaptive in that it can learn to produce an output related to the consistency or similarity of the inputs applied thereto. The Martin U.S. Pat. No. 3,394,351, issued on July 23, 1968 discloses neuron circuits with sigmoid transfer characteristics which circuits can be interconnected to perform various digital logic functions as well as analog functions.
The Rosenblatt U.S. Pat. No. 3,287,649, issued Nov. 22, 1966 shows a perceptron circuit which is capable of speech pattern recognition. The Winnik et al. U.S. Pat. No. 3,476,954, issued on Nov. 4, 1969 discloses a neuron circuit including a differential amplifier, 68, in FIG. 2. The Cooper et al. U.S. Pat. No. 3,950,733, issued on Apr. 13, 1976 discloses an adaptive information processing system including neuron-like circuits called mnemonders which couple various ones (or a multiplicity) of the input terminals with various ones (or a multiplicity) of the output terminals. Means are provided for modifying the transfer function of these mnemonders in dependence on the product of at least one of the input signals and one of the output responses of what they call a Nestor circuit.
None of these patents utilize the Boltzmann algorithm or any variation thereof as part of the learning process, none utilizes simulated annealing, and none of these circuits is particularly suitable for VLSI implementation.