1. Field of Invention
This invention relates to building computers with a type of intelligence that is typical for humans and animals. In particular, a parallel distributed signal processing device is described that uses Hebbian learning rules to train spiking neural networks for data recognition and prediction.
2. Discussion of Prior Art
Computational Methods for Internal Models
Animal learning psychologists, engineers, and philosophers have long been speculating that basic aspects of cognition may be explained with the hypothesis that the brain learns and uses internal models. Internal models may cause aspects of imagination and dreaming. Internal models can be used to learn sequences and to form novel associative chains by connecting learned sequences. Two types of internal models are usually distinguished; forward models predict future experience whereas inverse models are used to estimate the past causes of the current experience. I use the term “internal model” according to the definition used in engineering sciences: If a set of sensors measure the past and current inputs as well as past outputs of a dynamic system, a correct internal model is able to predict the current system output. Furthermore, the potential future evolution of the sensory input can be emulated by also using forward models to predict the next sensory input. A correct internal model emulates the experience of an agent in the real world by providing the sensory experience to the agent that would have resulted from his actions without really executing them. This enables the agent to evaluate the consequences of potential actions, and to select the action with the best predicted outcome.
Internal models have widespread applications in engineering sciences for the control of physical systems with a temporal dynamics. They are often represented by a set of linear differential equations. Several methods have been developed that fit the correct model parameters values for a given set of linear differential equations by using the experienced sensor signals. These computational methods are usually called parameter identification methods or system identification methods. The mathematical methods for system identification are reviewed by Ljung and Soderstrom, “System Identification,” MIT Press (1987). One of these methods is Correlation Analysis, which uses temporal correlations between the inputs and the outputs to compute the values of the model parameters that minimize the difference between the predicted and the actual outputs. Correlation Analysis is reviewed by Godfrey, “Correlation Methods,” Automatica 16, 527-534 (1980). Similar to spike-coding used for the current invention, Correlation Analysis filters the input signals, such that they become uncorrelated and the covariance matrix becomes diagonal. Correlation Analysis and the current invention have the crucial advantage that the inversion of this covariance matrix is trivial, which simplifies the system identification task. However, Correlation Analysis has several drawbacks as compared to the method proposed here. First, the applications of Correlation Analysis are limited, as it only deals with linear dynamic systems. Second, since Correlation Analysis is not a parallel distributed method, such as neural networks, hardware implementations of Correlation Analysis would be slow for rich data sources. Third, Correlation Analysis uses analogue input and output signals instead of spike-coding, and is therefore not robust to the inaccurate electronic elements that are typically used for hardware implementations such as VLSI circuits. Fourth, for many problems, some of the information is not directly accessible by the sensors, such as a driving car that is temporally hidden behind a house. Many algorithms reconstruct such information on so-called hidden states. The current invention uses non-sensory, hidden neurons to represent such hidden states with methods including the computation of the first principal component. Unfortunately, Correlation Analysis does not provide a method to estimate these hidden states.
Instead of expressing an internal model by linear differential equations, data may be represented in a binary format such that they can be modeled by a Hidden Markov Model (HMM). HMMs assume that exactly one node in a network of nodes is activated for each time step and that this activation is transmitted to other nodes with certain transition probabilities. These transition probabilities serve as the model parameters that are acquired, or learned, by experience. Most methods for fitting the transition probabilities of HMMs are neither suitable for parallel distributed processing nor for on-line learning. Some on-line learning algorithms have been proposed for fitting the transition probabilities (Baldi and Chauvin, “Smooth On-Line Learning Algorithms for Hidden Markov Models.” Neural Computation, Vol. 6, 2, 305-316, 1994). These authors derive their algorithms using several approximations. Unfortunately, no computer simulations are shown, which leaves some doubts on the performance of these algorithms. Furthermore, the capabilities of HMMs are limited, as they require that exactly one non-exit node is activated in each time step. (This requirement does not apply to the trivial case of exit nodes that cannot influence the activations of other nodes.) In contrast, for the current invention, the number of spikes occurring at the same time in the subset of network components that have an influence on other nodes in the network can be different from one. The restriction of HMMs to a single activation per time step requires coordinating for any time step the activation of all nodes that have non-zero activation probabilities. The requirement of such a coordination function would hamper the efficiency of hardware implementations. The use of spike coding by the current invention avoids this problem.
There is a large body of literature about learning sequences and identification of non-linear dynamic systems by artificial neural networks. These neural networks typically use variants of the backpropagation learning rule and some variants were developed for on-line learning (see Williams and Peng, “An Efficient Gradient-Based Algorithm for On-Line Training of Recurrent Network Trajectories,” Neural Computation, 2, 490-501, 1990). Due to the peculiar features of time series, the traditional backpropagation algorithm becomes very complicated for on-line learning.
Computational Methods for Data Representation
Another characteristic feature of human intelligence is the ability to categorize sensory information without a teacher, which is also called unsupervised learning. Humans are able to summarize sensory information, for example by categorizing, recognizing, and naming objects. Data compression methods and source separation methods perform such data representation tasks. The most popular data compression method is Principal Component Analysis (PCA), which compresses the data such that the error to the true data is minimal for a given number of components. PCA can usually be performed by using a Singular Value Decomposition (SVD). If a continuous time variable is involved, PCA is called Karhunen-Loève expansion. PCA is also closely related to Eigenanalysis, to Hotelling transform, and to a popular type of factor analysis. I summarize these methods by the term PCA. As an alternative to data compression methods, source separation methods can be used for data representation, such as Independent Component Analysis (ICA). ICA separates data in factors that are statistically independent. PCA and ICA can be computed with parallel distributed algorithms that have been used to train artificial neural networks (U.S. Pat. No. 5,706,402, January 1998 Bell; Baldi and Hornik, “Learning in Linear Neural Networks: A Survey”, IEEE Trans. ASSP 6: 837858, 1995). Parallel distributed algorithms for computing PCA and ICA have also been developed for non-linear problems (Hyvärinen and Oja, “Independent Component Analysis by General Non-linear Hebbian-like Learning Rules.” Signal Processing, 64(3):301-313, 1998; Oja et al. “Principal and independent components in neural networks—Recent developments,” in Proc.VII Italian Wkshp. Neural Nets WIRN'95, Vietri sul Mare, Italy, 1995). To analyze binary data with PCA or ICA, the same software algorithms have been used than those developed for analogue data (Zuo, Wang, and Tan, “PCA-Based Personal Handwriting Identification,” International Conference on Image and Graphics (ICIG), 2002; Himberg and Hyvärinen, “Independent component analysis for binary data: An experimental study.” In: Proc. Int. Workshop on Independent Component Analysis and Blind Signal Separation (ICA2001), San Diego, Calif., 2001). In contrast to the current invention, these algorithms are based on analogue signals, and are thus sensitive to inaccurate processing elements, which hampers efficient hardware implementations. The application of these algorithms to binary input and output signals of a neuron-like element is not trivial, since the principal component of a binary signal is not a binary signal.
Hopfield networks have also been used for data representation. They are particularly suitable to filter out noise in binary patterns to recognize learned binary patterns. Patterns are presented to Hopfield networks for a sufficiently long time such that the dynamics of the network equilibrates to the minimum energy, which serves as an error function. After equilibrium has been reached, the interaction strengths (weights) between neuron-like units are updated with a Hebbian learning rule. Each neuron is typically required to receive input from all other neurons in the network, and the interaction strengths between two neurons are often required to be symmetric. Boltzmann machines are an improvement to Hopfield networks. Boltzmann machines use stochastic neurons, where the random noise decreases during learning similar to the annealing of a spin glass. Boltzmann machines have been implemented in hardware to increase the processing speed (Skubiszewski M. “An Exact Hardware Implementation of the Boltzmann Machine.” 4th IEEE Symposium on Parallel and Distributed Processing 1992). Unfortunately, Hopfield networks and Botzmann machines require substantial processing power, as the network interactions have to converge to a stable equilibrium before the next pattern can be presented. Such a convergence phase is not required for the current invention. Hopfield networks and Boltzmann machines use binary coding, whereas the current invention uses spike-coding. In contrast to the spikes of the current neural network, the states in Hopfield networks and Boltzmann machines do not mark time points, as the result of the computation is given by the final state after convergence, and the time when a state changes is irrelevant. In contrast to the current invention, Hopfield networks and Boltzmann machines typically have to be fully connected and the interaction strength between each pair of connected neurons has to be symmetric, which limits the capabilities of these networks. Furthermore, to teach the synaptic strengths of hidden neurons, Hebbian and Anti-Hebbian rule need to be applied alternating, which is computationally demanding.
Biological Neural Network Models
The structure and operations of biological neural networks are extremely complex and involve many physical, biological, and chemical processes. A biological nervous system is a complex network of nerve cells (neurons) that receives and processes external stimuli to produce, exchange, and store information. A neuron may be described as a cell that receives electric signals from other neurons and has an output terminal for exporting a signal to other neurons, which is called the axon. The signals exchanged between neurons contain action potentials (spikes), which are pulses of about equal amplitude and shape for a given neuron class. The junction between two neurons is called synapse. Human learning and memory is believed to be stored in the strengths of these synapses. Various simplified neural models have been developed based on certain aspects of biological nervous systems. One description of the operation of a general neural network is as follows. An action potential originated by a presynaptic neuron generates a postsynaptic potential in a postsynaptic neuron. The membrane potential of the postsynaptic neuron is the sum of all the postsynaptic potentials caused by input spikes. The soma of the postsynaptic neuron generates an action potential if the summed potential exceeds a threshold potential. This action potential then propagates through the axon and its branches to the synapse of other neurons. The above process forms the basis for information processing, storage, and exchange in many neural network models. In such models, the synaptic strengths are also called weights. These weights are often compared to parameters of optimization algorithms, since it is assumed that the synaptic strengths in biological networks are optimized during learning for a currently unknown task. See Bose and Liang, “Neural network fundamentals with graphs, algorithms, and applications,” McGraw-Hill (1996).
Recent experimental findings revealed that adaptation of the connection strengths of synapses between cortical pyramidal neurons depends on the time interval between the presynaptic and the postsynaptic spike. These studies demonstrated long-term potentiation (LTP, weight increase) of synaptic strengths between cortical pyramidal neurons if the presynaptic spike precedes the postsynaptic spike, and long-term depression (LTD, weight decrease) if the temporal order is reversed (reviewed by Roberts and Bell, “Spike timing dependent synaptic plasticity in biological systems.” Biol Cybern. 87(5-6):392-403, 2002). This biological form of learning is also referred to as Temporally Asymmetric Hebbian (TAH) learning or Spike-Timing Dependent (STD) learning and is believed to be the biological basis of cortical learning. This finding triggered a series of simulation studies that used TAH learning to train computational models of spiking neurons using biologically-plausible assumptions. These studies have not yet lead to a consensus on what type of tasks biological networks may learn with TAH learning, since this seems to depend on how neurons and TAH learning is modeled, how the neurons are connected, what are the initial synaptic strengths, how one makes sense of the output spikes, how the task is chosen, and how the input spike trains are chosen.
A simulation studies demonstrated that TAH learning may be useful for learning navigational maps (Gerstner W, Abbott L F Learning navigational maps through potentiation and modulation of hippocampal place cells. J Comput Neurosci 4, 1: 79-94, 1997). Some other studies demonstrated that TAH learning may adapt synaptic strength such that spike sequences can be learned by chains of neurons (Levy W B A sequence predicting CA3 is a flexible associator that learns and uses context to solve hippocampal-like tasks. Hippocampus 6, 6: 579-590, 1996; Rao and Sejnowski Spike-timing-dependent Hebbian plasticity as temporal difference learning. Neural Comput 13, 10: 2221-2237, 2001; August and Levy. “Temporal sequence compression by an integrate-and-fire model of hippocampal area CA3. J Comput Neurosci 6, 1: 71-90, 1999). Porr B, Worgotter F. Isotropic sequence order learning. Neural Comput. 2003 April; 15(4):831-64. Furthermore, it was suggested that TAH learning may be related to classical conditioning (Roberts P D, Bell C C. Spike timing dependent synaptic plasticity in biological systems. Biol Cybern. 87(5-6): 392-4032002). A further line of research on cerebellar neurons suggested that TAH learning can be used for sensory image cancellation (Roberts and Bell “Computational consequences of temporally asymmetric learning rules: II. Sensory image cancellation.” J Comput Neurosci. 2000; 9(1):67-83. Williams et al. “Stability of negative-image equilibria in spike-timing-dependent plasticity.” Phys Rev E Stat Nonlin Soft Matter Phys. 2003 August; 68(2 Pt 1):021923. Epub 2003 Aug. 29). The properties of these computer models are not sufficiently understood to be used for learning more complex tasks. In contrast to the current invention, none of these articles claimed that these models optimize an error or energy function or how to use these models for information processing. Therefore, the goal or task of these models is not known in a mathematical sense but only intuitively as “image cancellation” or “sequences learning”. In contrast to the current invention, these algorithms do not have a clearly defined purpose. In particular, it is typically unclear what the correct, best, or desired performance of the algorithm after learning is supposed to be. In addition, it is not known which conditions have to be fulfilled that these models show a useful performance. These conditions include the initial, minimal, and maximal values of the weights, the maximal lengths of sequences, the duration of delays, the neuronal connectivity, the transmission delays between neurons, the number of neurons, the normalization of synaptic strengths, as well as the choice of the functions used in the learning rule and for modeling the neuron. This lack of analytical knowledge makes it unlikely that larger neural networks would demonstrate a performance that is useful for engineering science applications. Furthermore, these algorithms are computationally demanding, since many biological mechanisms were modeled that may not be required for their performance.
There were some attempts to derive a TAH learning rule from an optimization criterion (Pfister, Barber, and Gerstner: “Optimal Hebbian Learning: A Probabilistic Point of View.” In ICANN/ICONIP 2003, Istanbul, Turkey, Jun. 26-29, 2003: 92-98). This publication was published after submission of the PPA for the current invention. These authors derived a spike timing-dependent learning rule from an optimization criterion for a neuron with a noisy membrane potential that receives input by a single synapse. Their learning rule maximizes the likelihood of observing a postsynaptic spike train with a desired timing, given the firing rate. Unfortunately, this optimization criterion has only been expressed in neurophysiologic terms and the mathematically-interested reader is left wondering what it could be used for. Furthermore, the learning rule was only derived for a single synapse and they do not provide any arguments that would suggest that this optimization criterion would also be maximized if multiple synapses are trained. Moreover, the derived learning rule decreases the synaptic strengths for most time intervals between presynaptic spikes and postsynaptic spikes. This is particularly troublesome since synaptic strengths decrease for all long time intervals between pre- and postsynaptic strengths. Therefore, the synaptic strengths would usually all decrease to zero. This raises the suspicion that the algorithm may not provide a useful result, in particular since these authors do not show simulations to demonstrate the performance of their algorithm.
Hardware Implementations of TAH Learning
Many components of neural networks have been implemented in hardware to increase computing speed (see 6,363,369 B1, March 2002 Liaw et al.; 6,643,627 B2 November 2003 Liaw et al.). In particular, the basic biological findings on TAH learning have been attempted to implement in VLSI (Bofill, Adria, Alan Murray, Damon Thompson. Circuits for VLSI implementation of temporally-asymmetric Hebbian learning. Neural Information Processing Systems 2001 NIPS 14). These authors describe a hardware implementation of a neuron model and of the TAH learning rule. Unfortunately, they do not suggest a task their hardware implementation is supposed to solve. Their hardware implementation is thus a model replicating biological findings, but is not a method for information processing. It seems impossible to get the correct tuning of the model parameters, the correct connection of the network, the correct network connections and delays, and the correct representation of the input data without knowing what the network is supposed to achieve.
Biological Networks
Instead of implementing the computational functions of a neural network in hardware, researchers explored directly using nervous tissue as computing device (Bi Guo-qiang and Poo Mu-ming. “Distributed synaptic modification in neural networks induced by patterned stimulation.” Nature 401, 792-796, 1999). These researchers found changes in the reaction of neural tissue due to training with certain spike patterns. Unfortunately, the type of computing capabilities of such tissue did not become clear. In particular, it did not become clear how training and recall should be structured and what result should be achieved. Therefore, the use of biological networks for computing tasks has not yet been successful.