A neural network (NN), in the case of artificial neurons called artificial neural network (ANN) or simulated neural network (SNN), is an interconnected group of artificial neurons that uses a mathematical or computational model for information processing based on a connectionist approach to computation. In most cases an ANN is, in formulation and/or operation, an adaptive system that changes its structure based on external or internal information that flows through the network. Modern neural networks are non-linear statistical data modeling tools. They are usually used to model complex relationships between inputs and outputs or to find patterns in data. In more practical terms neural networks are non-linear statistical data modeling or decision making tools. They can be used to model complex relationships between inputs and outputs or to find patterns in data. See, en.wikipedia.org/wiki/Artificial_neural_network
An artificial neural network involves a network of simple processing elements (artificial neurons) which can exhibit complex global behavior, determined by the connections between the processing elements and element parameters. One classical type of artificial neural network is the recurrent Hopfield net. In a neural network model simple nodes, which can be called variously “neurons”, “neurodes”, “Processing Elements” (PE) or “units”, are connected together to form a network of nodes—hence the term “neural network”. While a neural network does not have to be adaptive per se, its practical use comes with algorithms designed to alter the strength (weights) of the connections in the network to produce a desired signal flow. However, training of the network does not have to be continuous. The Perceptron is essentially a linear classifier for classifying data x∈n specified by parameters w∈n, b∈ and an output function ƒ=w′x+b. Its parameters are adapted with an ad-hoc rule similar to stochastic steepest gradient descent. Because the inner product is a linear operator in the input space, the Perceptron can only perfectly classify a set of data for which different classes are linearly separable in the input space, while it often fails completely for non-separable data. While the development of the algorithm initially generated some enthusiasm, partly because of its apparent relation to biological mechanisms, the later discovery of this inadequacy caused such models to be abandoned until the introduction of non-linear models into the field.
The rediscovery of the backpropagation algorithm was probably the main reason behind the re-popularization of neural networks after the publication of “Learning Internal Representations by Error Propagation” in 1986 (Though backpropagation itself dates from 1974). The original network utilized multiple layers of weight-sum units of the type ƒ=g(w′x+b), where g was a sigmoid function or logistic function such as used in logistic regression. Training was done by a form of stochastic steepest gradient descent. The employment of the chain rule of differentiation in deriving the appropriate parameter updates results in an algorithm that seems to ‘backpropagate errors’, hence the nomenclature. Determining the optimal parameters in a model of this type is not trivial, and steepest gradient descent methods cannot be relied upon to give the solution without a good starting point. In recent times, networks with the same architecture as the backpropagation network are referred to as Multi-Layer Perceptrons. This name does not impose any limitations on the type of algorithm used for learning.
There are three major learning paradigms, each corresponding to a particular abstract learning task. These are supervised learning, unsupervised learning and reinforcement learning. Usually any given type of network architecture can be employed in any of those tasks. In supervised learning, we are given a set of example pairs (x,y), x∈X, y∈Y and the aim is to find a function ƒ in the allowed class of functions that matches the examples. In other words, we wish to infer how the mapping implied by the data and the cost function is related to the mismatch between our mapping and the data. In unsupervised learning, we are given some data x, and a cost function which is to be minimized which can be any function of x and the network's output, ƒ. The cost function is determined by the task formulation. Most applications fall within the domain of estimation problems such as statistical modeling, compression, filtering, blind source separation and clustering. In reinforcement learning, data x is usually not given, but generated by an agent's interactions with the environment. At each point in time t, the agent performs an action yt and the environment generates an observation xt and an instantaneous cost ct, according to some (usually unknown) dynamics. The aim is to discover a policy for selecting actions that minimizes some measure of a long-term cost, i.e. the expected cumulative cost. The environment's dynamics and the long-term cost for each policy are usually unknown, but can be estimated. ANNs are frequently used in reinforcement learning as part of the overall algorithm. Tasks that fall within the paradigm of reinforcement learning are control problems, games and other sequential decision making tasks. These will be discussed in further detail below.
There are many algorithms for training neural networks; most of them can be viewed as a straightforward application of optimization theory and statistical estimation. They include: Back propagation by gradient descent, Rprop, BFGS, CG etc. Evolutionary computation methods, simulated annealing, expectation maximization, non-parametric methods, particle swarm optimization and other swarm intelligence techniques are among other commonly used methods for training neural networks.
Neural network models in artificial intelligence are usually referred to as artificial neural networks (ANNs); these are essentially simple mathematical models defining a function ƒ: X→Y. Each type of ANN model corresponds to a class of such functions. The word network in the term ‘artificial neural network’ arises because the function ƒ(x) is defined as a composition of other functions gi(x), which can further be defined as a composition of other functions. This can be conveniently represented as a network structure, with arrows depicting the dependencies between variables. A widely used type of composition is the nonlinear weighted sum, where
            f      ⁡              (        x        )              =          K      ⁡              (                              ∑            i                    ⁢                                    w              i                        ⁢                                          g                i                            ⁡                              (                x                )                                                    )              ,where K (commonly referred to as the activation function) is some predefined function, such as the hyperbolic tangent. It will be convenient for the following to refer to a collection of functions gi as simply a vector g=(g1, g2, . . . , gn).
FIG. 12 depicts a decomposition of ƒ, with dependencies between variables indicated by arrows. These can be interpreted in two ways. The first view is the functional view: the input x is transformed into a 3-dimensional vector h, which is then transformed into a 2-dimensional vector g, which is finally transformed into ƒ. This view is most commonly encountered in the context of optimization. The second view is the probabilistic view: the random variable F=ƒ(G) depends upon the random variable G=g(H), which depends upon H=h(X), which depends upon the random variable X. This view is most commonly encountered in the context of graphical models. The two views are largely equivalent. In either case, for this particular network architecture, the components of individual layers are independent of each other (e.g., the components of g are independent of each other given their input h). This naturally enables a degree of parallelism in the implementation. Networks such as shown in FIG. 12 are commonly called feedforward, because their graph is a directed acyclic graph.
FIG. 13 shows a recurrent network. Such networks are commonly depicted in the manner shown in FIG. 13A, where ƒ is shown as being dependent upon itself. However, there is an implied temporal dependence which is exemplified in the equivalent FIG. 13B.
The possibility of learning has generated significant interest in neural networks. Given a specific task to solve, and a class of functions F, learning means using a set of observations to find ƒ*∈F which solves the task in some optimal sense. This entails defining a cost function C*F→ such that, for the optimal solution ƒ*, C(ƒ*)≤C(ƒ)∀ƒ∈F (i.e., no solution has a cost less than the cost of the optimal solution).
The cost function C is an important concept in learning, as it is a measure of how far away a particular solution is from an optimal solution to the problem to be solved. Learning algorithms search through the solution space to find a function that has the smallest possible cost. For applications where the solution is dependent on some data, the cost must necessarily be a function of the observations, otherwise we would not be modelling anything related to the data. It is frequently defined as a statistic to which only approximations can be made. As a simple example consider the problem of finding the model ƒ which minimizes C=E[(ƒ(x)−y)2], for data pairs (x,y) drawn from some distribution . In practical situations we would only have N samples from  and thus, for the above example, we would only minimize
      C    ^    =            1      N        ⁢                  ∑                  i          =          1                N            ⁢                                    (                                          f                ⁡                                  (                                      x                    i                                    )                                            -                              y                i                                      )                    2                .            Thus, the cost is minimized over a sample of the data rather than the entire data set. When N→∞ some form of online machine learning must be used, where the cost is partially minimized as each new example is seen. While online machine learning is often used when  is fixed, it is most useful in the case where the distribution changes slowly over time. In neural network methods, some form of online machine learning is frequently used for finite datasets.
While it is possible to define some arbitrary, ad hoc cost function, frequently a particular cost will be used, either because it has desirable properties (such as convexity) or because it arises naturally from a particular formulation of the problem (e.g., in a probabilistic formulation the posterior probability of the model can be used as an inverse cost). Ultimately, the cost function will depend on the task we wish to perform. There are three major learning paradigms, each corresponding to a particular abstract learning task. These are supervised learning, unsupervised learning and reinforcement learning. Usually any given type of network architecture can be employed in any of those tasks.
In supervised learning, we are given a set of example pairs (x,y), x∈X, y∈Y and the aim is to find a function ƒ: X→Y in the allowed class of functions that matches the examples. In other words, we wish to infer the mapping implied by the data; the cost function is related to the mismatch between our mapping and the data and it implicitly contains prior knowledge about the problem domain. A commonly used cost is the mean-squared error which tries to minimize the average squared error between the network's output, f(x), and the target value y over all the example pairs. When one tries to minimize this cost using gradient descent for the class of neural networks called Multi-Layer Perceptrons, one obtains the common and well-known backpropagation algorithm for training neural networks. Tasks that fall within the paradigm of supervised learning are pattern recognition (also known as classification) and regression (also known as function approximation). The supervised learning paradigm is also applicable to sequential data (e.g., for speech and gesture recognition). This can be thought of as learning with a “teacher,” in the form of a function that provides continuous feedback on the quality of solutions obtained thus far.
In unsupervised learning we are given some data x and the cost function to be minimized, that can be any function of the data x and the network's output, ƒ. The cost function is dependent on the task (what we are trying to model) and our a priori assumptions (the implicit properties of our model, its parameters and the observed variables). As a trivial example, consider the model ƒ(x)=a, where a is a constant and the cost C=E[(x−ƒ(x))2]. Minimizing this cost will give us a value of a that is equal to the mean of the data. The cost function can be much more complicated. Its form depends on the application: for example, in compression it could be related to the mutual information between x and y, whereas in statistical modelling, it could be related to the posterior probability of the model given the data. (Note that in both of those examples those quantities would be maximized rather than minimized). Tasks that fall within the paradigm of unsupervised learning are in general estimation problems; the applications include clustering, the estimation of statistical distributions, compression and filtering.
In reinforcement learning, data x are usually not given, but generated by an agent's interactions with the environment. At each point in time t, the agent performs an action yt and the environment generates an observation xt and an instantaneous cost ct, according to some (usually unknown) dynamics. The aim is to discover a policy for selecting actions that minimizes some measure of a long-term cost; i.e., the expected cumulative cost. The environment's dynamics and the long-term cost for each policy are usually unknown, but can be estimated. More formally, the environment is modeled as a Markov decision process (MDP) with states s1, . . . , sn∈S and actions a1, . . . , am∈A with the following probability distributions: the instantaneous cost distribution P(ct|st), the observation distribution P(xt|st) and the transition P(st+1|st,at), while a policy is defined as conditional distribution over actions given the observations. Taken together, the two define a Markov chain (MC). The aim is to discover the policy that minimizes the cost; i.e., the MC for which the cost is minimal. ANNs are frequently used in reinforcement learning as part of the overall algorithm. Tasks that fall within the paradigm of reinforcement learning are control problems, games and other sequential decision making tasks.
Reinforcement learning is a sub-area of machine learning concerned with how an agent ought to take actions in an environment so as to maximize some notion of long-term reward. Reinforcement learning algorithms attempt to find a policy that maps states of the world to the actions the agent ought to take in those states. In economics and game theory, reinforcement learning is considered as a boundedly rational interpretation of how equilibrium may arise.
The environment is typically formulated as a finite-state Markov decision process (MDP), and reinforcement learning algorithms for this context are highly related to dynamic programming techniques. State transition probabilities and reward probabilities in the MDP are typically stochastic but stationary over the course of the problem. See, http://webdocs.cs.ualberta.ca/˜sutton/book/ebook/the-book.html, expressly incorporated herein by reference.
Reinforcement learning differs from the supervised learning problem in that correct input/output pairs are never presented, nor sub-optimal actions explicitly corrected. Further, there is a focus on on-line performance, which involves finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge). The exploration vs. exploitation trade-off in reinforcement learning has been mostly studied through the multi-armed bandit problem. Formally, the basic reinforcement learning model, as applied to MDPs, consists of: a set of environment states S; a set of actions A; and a set of scalar “rewards” in .
At each time t, the agent perceives its state st∈S and the set of possible actions A(st). It chooses an action a∈A(st) and receives from the environment the new state st+1 and a reward rt. Based on these interactions, the reinforcement learning agent must develop a policy π: S×T→A (where T is the set of possible time indexes) which maximizes the quantity R=r0+r1+ . . . +rn for MDPs which have a terminal state, or the quantity
  R  =            ∑              t        =        0            ∞        ⁢                  γ        t            ⁢              r        t            for MDPs without terminal states (where 0≤γ≤1 is some “future reward” discounting factor).
After we have defined an appropriate return function to be maximized, we need to specify the algorithm that will be used to find the policy with the maximum return.
The naive brute force approach entails the following two steps: a) For each possible policy, sample returns while following it. b) Choose the policy with the largest expected return. One problem with this is that the number of policies can be extremely large, or even infinite. Another is that returns might be stochastic, in which case a large number of samples will be required to accurately estimate the return of each policy. These problems can be ameliorated if we assume some structure and perhaps allow samples generated from one policy to influence the estimates made for another. The two main approaches for achieving this are value function estimation and direct policy optimization.
Value function approaches attempt to find a policy that maximize the return by maintaining a set of estimates of expected returns for one policy π (usually either the current or the optimal one). In such approaches one attempts to estimate either the expected return starting from states and following π thereafter,V(s)=E[R|s,π],
or the expected return when taking action a in state s and following π; thereafter,Q(s,a)=E[R|s,πa].
If someone gives us Q for the optimal policy, we can always choose optimal actions by simply choosing the action with the highest value at each state. In order to do this using V, we must either have a model of the environment, in the form of probabilities P(s′|s,a), which allow us to calculate Q simply through
            Q      ⁡              (                  s          ,          a                )              =                  ∑                  s          ′                    ⁢                        V          ⁡                      (                          s              ′                        )                          ⁢                  P          ⁡                      (                                                            s                  ′                                |                s                            ,              a                        )                                ,
or we can employ so-called Actor-Critic methods, in which the model is split into two parts: the critic, which maintains the state value estimate V, and the actor, which is responsible for choosing the appropriate actions at each state.
Given a fixed policy π, estimating E[R|⋅] for γ=0 is trivial, as one only has to average the immediate rewards. The most obvious way to do this for γ=0 is to average the total return after each state. However this type of Monte Carlo sampling requires the MDP to terminate. The expectation of R forms a recursive Bellman equation: E[R|st]=rt+γE[R|st+1].
By replacing those expectations with our estimates, V, and performing gradient descent with a squared error cost function, we obtain the temporal difference learning algorithm TD(0). In the simplest case, the set of states and actions are both discrete and we maintain tabular estimates for each state. Similar state-action pair methods are Adaptive Heuristic Critic (AHC), SARSA and Q-Learning. All methods feature extensions whereby some approximating architecture is used, though in some cases convergence is not guaranteed. The estimates are usually updated with some form of gradient descent, though there have been recent developments with least squares methods for the linear approximation case.
The above methods not only all converge to the correct estimates for a fixed policy, but can also be used to find the optimal policy. This is usually done by following a policy π that is somehow derived from the current value estimates, i.e. by choosing the action with the highest evaluation most of the time, while still occasionally taking random actions in order to explore the space. Proofs for convergence to the optimal policy also exist for the algorithms mentioned above, under certain conditions. However, all those proofs only demonstrate asymptotic convergence and little is known theoretically about the behaviour of RL algorithms in the small-sample case, apart from within very restricted settings.
An alternative method to find the optimal policy is to search directly in policy space. Policy space methods define the policy as a parameterised function π(s, θ) with parameters θ. Commonly, a gradient method is employed to adjust the parameters. However, the application of gradient methods is not trivial, since no gradient information is assumed. Rather, the gradient itself must be estimated from noisy samples of the return. Since this greatly increases the computational cost, it can be advantageous to use a more powerful gradient method than steepest gradient descent. Policy space gradient methods have received a lot of attention in the last 5 years and have now reached a relatively mature stage, but they remain an active field. There are many other approaches, such as simulated annealing, that can be taken to explore the policy space. Other direct optimization techniques, such as evolutionary computation are used in evolutionary robotics.
Training a neural network model essentially means selecting one model from the set of allowed models (or, in a Bayesian framework, determining a distribution over the set of allowed models) that minimizes the cost criterion. There are numerous algorithms available for training neural network models; most of them can be viewed as a straightforward application of optimization theory and statistical estimation. Most of the algorithms used in training artificial neural networks employ some form of gradient descent. This is done by simply taking the derivative of the cost function with respect to the network parameters and then changing those parameters in a gradient-related direction. Evolutionary methods, simulated annealing, expectation-maximization and non-parametric methods are some commonly used methods for training neural networks. Temporal perceptual learning relies on finding temporal relationships in sensory signal streams. In an environment, statistically salient temporal correlations can be found by monitoring the arrival times of sensory signals. This is done by the perceptual network.
The utility of artificial neural network models lies in the fact that they can be used to infer a function from observations. This is particularly useful in applications where the complexity of the data or task makes the design of such a function by hand impractical.
The feedforward neural network was the first and arguably simplest type of artificial neural network devised. In this network, the information moves in only one direction, forward, from the input nodes, through the hidden nodes (if any) and to the output nodes. There are no cycles or loops in the network.
Radial Basis Functions are powerful techniques for interpolation in multidimensional space. A RBF is a function which has built into a distance criterion with respect to a center. Radial basis functions have been applied in the area of neural networks where they may be used as a replacement for the sigmoidal hidden layer transfer characteristic in Multi-Layer Perceptrons. RBF networks have two layers of processing: In the first, input is mapped onto each RBF in the ‘hidden’ layer. The RBF chosen is usually a Gaussian. In regression problems the output layer is then a linear combination of hidden layer values representing mean predicted output. The interpretation of this output layer value is the same as a regression model in statistics. In classification problems the output layer is typically a sigmoid function of a linear combination of hidden layer values, representing a posterior probability. Performance in both cases is often improved by shrinkage techniques, known as ridge regression in classical statistics and known to correspond to a prior belief in small parameter values (and therefore smooth output functions) in a Bayesian framework. RBF networks have the advantage of not suffering from local minima in the same way as Multi-Layer Perceptrons. This is because the only parameters that are adjusted in the learning process are the linear mapping from hidden layer to output layer. Linearity ensures that the error surface is quadratic and therefore has a single easily found minimum. In regression problems this can be found in one matrix operation. In classification problems the fixed non-linearity introduced by the sigmoid output function is most efficiently dealt with using iteratively re-weighted least squares. RBF networks have the disadvantage of requiring good coverage of the input space by radial basis functions. RBF centers are determined with reference to the distribution of the input data, but without reference to the prediction task. As a result, representational resources may be wasted on areas of the input space that are irrelevant to the learning task. A common solution is to associate each data point with its own center, although this can make the linear system to be solved in the final layer rather large, and requires shrinkage techniques to avoid overfitting.
Associating each input datum with an RBF leads naturally to kernel methods such as Support Vector Machines and Gaussian Processes (the RBF is the kernel function). All three approaches use a non-linear kernel function to project the input data into a space where the learning problem can be solved using a linear model. Like Gaussian Processes, and unlike SVMs, RBF networks are typically trained in a Maximum Likelihood framework by maximizing the probability (minimizing the error) of the data under the model. SVMs take a different approach to avoiding overfitting by maximizing instead a margin. RBF networks are outperformed in most classification applications by SVMs. In regression applications they can be competitive when the dimensionality of the input space is relatively small.
The self-organizing map (SOM) invented by Teuvo Kohonen performs a form of unsupervised learning. A set of artificial neurons learn to map points in an input space to coordinates in an output space. The input space can have different dimensions and topology from the output space, and the SOM will attempt to preserve these.
Contrary to feedforward networks, recurrent neural networks (RNs) are models with bi-directional data flow. While a feedforward network propagates data linearly from input to output, RNs also propagate data from later processing stages to earlier stages.
A simple recurrent network (SRN) is a variation on the Multi-Layer Perceptron, sometimes called an “Elman network” due to its invention by Jeff Elman. A three-layer network is used, with the addition of a set of “context units” in the input layer. There are connections from the middle (hidden) layer to these context units fixed with a weight of one. At each time step, the input is propagated in a standard feed-forward fashion, and then a learning rule (usually back-propagation) is applied. The fixed back connections result in the context units always maintaining a copy of the previous values of the hidden units (since they propagate over the connections before the learning rule is applied). Thus the network can maintain a sort of state, allowing it to perform such tasks as sequence-prediction that are beyond the power of a standard Multi-Layer Perceptron.
In a fully recurrent network, every neuron receives inputs from every other neuron in the network. These networks are not arranged in layers. Usually only a subset of the neurons receive external inputs in addition to the inputs from all the other neurons, and another disjunct subset of neurons report their output externally as well as sending it to all the neurons. These distinctive inputs and outputs perform the function of the input and output layers of a feed-forward or simple recurrent network, and also join all the other neurons in the recurrent processing.
The Hopfield network is a recurrent neural network in which all connections are symmetric. Invented by John Hopfield in 1982, this network guarantees that its dynamics will converge. If the connections are trained using Hebbian learning then the Hopfield network can perform as robust content-addressable (or associative) memory, resistant to connection alteration.
The echo state network (ESN) is a recurrent neural network with a sparsely connected random hidden layer. The weights of output neurons are the only part of the network that can change and be learned. ESN are good to (re)produce temporal patterns.
The Long short term memory is an artificial neural net structure that unlike traditional RNNs doesn't have the problem of vanishing gradients. It can therefore use long delays and can handle signals that have a mix of low and high frequency components.
A stochastic neural network differs from a typical neural network because it introduces random variations into the network. In a probabilistic view of neural networks, such random variations can be viewed as a form of statistical sampling, such as Monte Carlo sampling.
The Boltzmann machine can be thought of as a noisy Hopfield network. Invented by Geoff Hinton and Terry Sejnowski in 1985, the Boltzmann machine is important because it is one of the first neural networks to demonstrate learning of latent variables (hidden units). Boltzmann machine learning was at first slow to simulate, but the contrastive divergence algorithm of Geoff Hinton (circa 2000) allows models such as Boltzmann machines and products of experts to be trained much faster.
Biological studies have shown that the human brain functions not as a single massive network, but as a collection of small networks. This realization gave birth to the concept of modular neural networks, in which several small networks cooperate or compete to solve problems. A committee of machines (CoM) is a collection of different neural networks that together “vote” on a given example. This generally gives a much better result compared to other neural network models. Because neural networks suffer from local minima, starting with the same architecture and training but using different initial random weights often gives vastly different networks. A CoM tends to stabilize the result. The CoM is similar to the general machine learning bagging method, except that the necessary variety of machines in the committee is obtained by training from different random starting weights rather than training on different randomly selected subsets of the training data.
The ASNN is an extension of the committee of machines that goes beyond a simple/weighted average of different models. ASNN represents a combination of an ensemble of feed-forward neural networks and the k-nearest neighbor technique (kNN). It uses the correlation between ensemble responses as a measure of distance amid the analyzed cases for the kNN. This corrects the bias of the neural network ensemble. An associative neural network has a memory that can coincide with the training set. If new data become available, the network instantly improves its predictive ability and provides data approximation (self-learn the data) without a need to retrain the ensemble. Another important feature of ASNN is the possibility to interpret neural network results by analysis of correlations between data cases in the space of models. The method is demonstrated at www.vcclab.org, where you can either use it online or download it.
A physical neural network includes electrically adjustable resistance material to simulate artificial synapses. Examples include the ADALINE neural network developed by Bernard Widrow in the 1960's and the memristor based neural network developed by Greg Snider of HP Labs in 2008.
Holographic associative memory represents a family of analog, correlation-based, associative, stimulus-response memories, where information is mapped onto the phase orientation of complex numbers operating.
Instantaneously trained neural networks (ITNNs) were inspired by the phenomenon of short-term learning that seems to occur instantaneously. In these networks the weights of the hidden and the output layers are mapped directly from the training vector data. Ordinarily, they work on binary data, but versions for continuous data that require small additional processing are also available.
Spiking neural networks (SNNs) are models which explicitly take into account the timing of inputs. The network input and output are usually represented as series of spikes (delta function or more complex shapes). SNNs have an advantage of being able to process information in the time domain (signals that vary over time). They are often implemented as recurrent networks. SNNs are also a form of pulse computer. Spiking neural networks with axonal conduction delays exhibit polychronization, and hence could have a very large memory capacity. Networks of spiking neurons—and the temporal correlations of neural assemblies in such networks—have been used to model figure/ground separation and region linking in the visual system (see, for example, Reitboeck et al. in Haken and Stadler: Synergetics of the Brain. Berlin, 1989).
Dynamic neural networks not only deal with nonlinear multivariate behaviour, but also include (learning of) time-dependent behavior such as various transient phenomena and delay effects.
Cascade-Correlation is an architecture and supervised learning algorithm developed by Scott Fahlman and Christian Lebiere. Instead of just adjusting the weights in a network of fixed topology, Cascade-Correlation begins with a minimal network, then automatically trains and adds new hidden units one by one, creating a multi-layer structure. Once a new hidden unit has been added to the network, its input-side weights are frozen. This unit then becomes a permanent feature-detector in the network, available for producing outputs or for creating other, more complex feature detectors. The Cascade-Correlation architecture has several advantages over existing algorithms: it learns very quickly, the network determines its own size and topology, it retains the structures it has built even if the training set changes, and it requires no back-propagation of error signals through the connections of the network. See: Cascade correlation algorithm.
A neuro-fuzzy network is a fuzzy inference system in the body of an artificial neural network. Depending on the FIS type, there are several layers that simulate the processes involved in a fuzzy inference like fuzzification, inference, aggregation and defuzzification. Embedding an FIS in a general structure of an ANN has the benefit of using available ANN training methods to find the parameters of a fuzzy system.
Compositional pattern-producing networks (CPPNs) are a variation of ANNs which differ in their set of activation functions and how they are applied. While typical ANNs often contain only sigmoid functions (and sometimes Gaussian functions), CPPNs can include both types of functions and many others. Furthermore, unlike typical ANNs, CPPNs are applied across the entire space of possible inputs so that they can represent a complete image. Since they are compositions of functions, CPPNs in effect encode images at infinite resolution and can be sampled for a particular display at whatever resolution is optimal.
One-shot associative memory networks can add new patterns without the need for re-training. It is done by creating a specific memory structure, which assigns each new pattern to an orthogonal plane using adjacently connected hierarchical arrays. The network offers real-time pattern recognition and high scalability, it however requires parallel processing and is thus best suited for platforms such as Wireless sensor networks (WSN), Grid computing, and GPGPUs.
The multi-layer perceptron (MLP) is a universal function approximator, as proven by the Cybenko theorem. Artificial neural network models have a property called ‘capacity’, which roughly corresponds to their ability to model any given function. It is related to the amount of information that can be stored in the network and to the notion of complexity.
In applications where the goal is to create a system that generalizes well in unseen examples, the problem of overtraining has emerged. This arises in overcomplex or overspecified systems when the capacity of the network significantly exceeds the needed free parameters. There are two schools of thought for avoiding this problem: The first is to use cross-validation and similar techniques to check for the presence of overtraining and optimally select hyperparameters such as to minimize the generalization error. The second is to use some form of regularization. This is a concept that emerges naturally in a probabilistic (Bayesian) framework, where the regularization can be performed by selecting a larger prior probability over simpler models; but also in statistical learning theory, where the goal is to minimize over two quantities: the ‘empirical risk’ and the ‘structural risk’, which roughly corresponds to the error over the training set and the predicted error in unseen data due to overfitting.
Supervised neural networks that use an MSE cost function can use formal statistical methods to determine the confidence of the trained model. The MSE on a validation set can be used as an estimate for variance. This value can then be used to calculate the confidence interval of the output of the network, assuming a normal distribution. A confidence analysis made this way is statistically valid as long as the output probability distribution stays the same and the network is not modified.
By assigning a softmax activation function on the output layer of the neural network (or a softmax component in a component-based neural network) for categorical target variables, the outputs can be interpreted as posterior probabilities. This is very useful in classification as it gives a certainty measure on classifications. The softmax activation function is:
      y    i    =                    e                  x          i                                      ∑                      j            =            1                    c                ⁢                  e                      x            j                                .  
See (each of which is expressly incorporated herein by reference):    “How Each Reinforcer Contributes to Value: “Noise” Must Reduce Reinforcer Value Hypberbolically”, Michael Lamport Commons, Michael Woodford, Edward James Trudeau    “Leadership, Cross-Cultural Contact, Socio-Economic Status, and Formal Operational Reasoning about Moral Dilemmas among Mexican Non-Literate Adults and High School Students”, Michael Lamport Commons, Jesus Francisco Galaz-Fontes, Stanley Jay Morse,    “Hierarchical Complexity Scoring System (HCSS) Applied to the Issues of Understanding Terrorism and Successfully Dealing with It”, Michael Lamport Commons, Alice Locicero, Sara Ross, Patrice Marie Miller    “Hierarchical Complexity Scoring System: How to Score Anything (also available in HTML)”, Michael Lamport Commons, Patrice Marie Miller, Eric Andrew Goodheart, Dorothy Danaher-Gilpin    “Review: Human Development and the Spiritual Life: How Consciousness Grows toward Transformation”, Michael Lamport Commons, Joel Funk    “Hierarchical Complexity: A Formal Theory”, Michael Lamport Commons, Alexander Pekker    “Organizing Components into Combinations: How Stage Transition Works”, Michael Lamport Commons, Francis Asbury Richards    “Illuminating Major Creative Innovators with the Model of Hierarchical Complexity”, Michael Lamport Commons, Linda Marie Bresette    “Some Reflections on Postformal Thought”, Helena Marchand“Development of Behavioral Stages in Animals”, Michael Lamport Commons, Patrice Marie Miller    “A Complete Theory of Tests for a Theory of Mind Must Consider Hierarchical Complexity and Stage”, Michael Lamport Commons, Myra Sturgeon White    “Society and the Highest Stages of Moral Development”, Gerhard Sonnert, Michael Lamport Commons    “A Complete Theory of Empathy Must Consider Stage Changes”, Michael Lamport Commons, Chester Arnold Wolfsont    “A Quantitative Behavioral Model of Developmental Stage Based upon Hierarchical Complexity Theory”, Michael Lamport Commons, Patrice Marie Miller    “The Notion of Events and Three Ways of Knowing: Problems with Mentalistic Explanations, Freewill, Self, Soul, and Intrinsic Motivation”, Michael Lamport Commons    “Stress, Consoling, and Attachment Interviews”, featuring Michael Lamport Commons    “A Short History of the Society for Quantitative Analyses of Behavior”, Michael Lamport Commons    “Hierarchical Complexity of Tasks Shows the Existence of Developmental Stages”, Michael Lamport Commons, Edward James Trudeau, Sharon Anne Stein, Francis Asbury Richards, Sharon R. Krause    Michael Lamport Commons, “Stacked Neural Networks Must Emulate Evolution's Hierarchical Complexity”, World Futures, 64: 444-451, 2008    A. Surendra Rao, “Artificial Neural Network Embedded Kalman Filter Bearing Only Passive Target Tracking”, Proceedings of the 7th Mediterranean Conference on Control and Automation, Haifa, Israel, 1999.    Marcello R. Napolitano, “Kalman Filters and Neural-Network Schemes for Sensor Validation in Flight Control Systems”, IEEE Transactions on Control Systems Technology, Vol. 6, No. 5, pg. 596, September 1998.    U.S. Pat. Nos. 6,347,297; 5,632,006; 5,517,598; 5,383,042; 5,333,125; 5,293,453; 5,177,746; 5,166,938; 5,129,038; and US App. 2009/0271189.
The missing ingredients in efforts to develop neural networks and artificial intelligence (AI) that can emulate human intelligence have been the evolutionary processes of performing tasks at increased orders of hierarchical complexity. Stacked neural networks based on the Model of Hierarchical Complexity could emulate evolution's actual learning processes and behavioral reinforcement. Modern notions of artificial neural networks are mathematical or computational models based on biological neural networks. They consist of an interconnected group of artificial neurons and nodes. They may share some properties of biological neural networks. Artificial neural networks are generally designed to solve traditional artificial intelligence tasks without necessarily attempting to model a real biological system. Computer systems or robots generally do not demonstrate signs of generalized higher adaptivity, and/or general learning—the capacity to go from learning one skill to learning another without dedicated programming.
Traditional neural networks are limited for two broad reasons. The first has to do with the relationship of the neural network tradition to AI. One of the problems is that AI models are based on notions of Turing machines. Almost all AI models are based on words or text. But Turing machines are not enough to really produce intelligence. At the lowest stages of development, they need effectors that produce a variety of responses—movement, grasping, emoting, and so on. They must have extensive sensors to take in more from the environment. Even though Carpenter and Grossberg's (1990, 1992) neural networks were to model simple behavioral processes, however, the processes they were to model were too complex. This resulted in neural networks that were relatively unstable and were not highly adaptable. When one looks at evolution, however, one sees that the first neural networks that existed were, for example, in Aplysia, Cnidarians (Phylum Cnidaria), and worms. They were specialized to perform just a few tasks even though some general learning was possible. They had simple tropisms and reflexes as well as reflexive and tropistic (including semi fixed action patterns to simple reinforcers and punishers. They performed tasks at the earliest stage or stages of evolution and development. The tasks they successfully address were at sensory or motor order 1 of hierarchical complexity. The development of neural networks can emulate evolution's approach of starting with simple task actions and building progressively more complex tasks.
Hierarchical stacked computer neural networks (Commons and White, 2006) use Commons' (Commons, Trudeau, Stein, Richards, and Krause, 1998) Model of Hierarchical Complexity. They accomplish the following tasks: model human development and learning; reproduce the rich repertoire of behaviors exhibited by humans; allow computers to mimic higher order human cognitive processes and make sophisticated distinctions between stimuli; and allow computers to solve more complex problems. Despite the contributions these features can make, there remain a number of challenges to resolve in developing stacked neural networks.
Stacked neural networks should preferably be informed by evolutionary biology and psychology, and model animal behavioral processes and functions. Neural networks should start to work at hierarchical complexity order 1 tasks (Sensory or Motor), sensing or acting but not coordinating the two. For example, the task to condition reflexes, and to identify and reflexively or tropistically consume food means that stimuli have to be detected out of a background of noise. Also, certain dangers need to be escaped from. They then should work on their own sufficiently without requiring constant programming attention. They should be stable. Once they prove stable, then they can be programmed into a stack of neural networks that address hierarchical complexity order 2 tasks (Circular Sensory-Motor stage 2), depending on input and reinforcement. One should keep trying various architectures until one gets one that works well and is robust. Order 2 tasks require that two instances of order 1 tasks be coordinated to make possible the simply reinforcement correct choices to simple input signals.
The neural network at its base provides a negative power function discounting for past events to be operative. Negative discounting means that past and future events are weighted less the further from the present behavior. It makes the network more stable and adaptive. By discounting the past, it is more open to change based on new information. Because the updating places more weight on the immediate, it does not succumb so much to overlearning (Commons and Pekker, 2007). There should be a large number of such networks, each designed for a very specific task as well as some designed to be flexible. Then one should make a large group of them at stage 2.
With robots, one would reinforce correct answers at stage 2. At each stage, there should be different networks for different activities and tasks. At stage 1 and 2, very local networks (activities) are provided for each particular motion. This would include successful reflexes, tropisms at fixed action patterns at stage 1 and operant discriminations at stage 2 and conceptual learning at stage 3. These could be frozen by transferring them to standard neural networks. That is to take some of them, “declare” them and thereby develop the hardware for them so each time one builds a network needing that functionality one does not need to train them.
Specialized neural networks are developed for all the domains to recognize the reinforcers and simple actions in these domains. Animal and human behavior and sensitivities have more to do with hierarchical complexity than with AI programs. There are unbelievable numbers of stage 1 and 2 mechanisms. The basic problem with traditional layered networks is that training has to have consequences. Consequences must include events the act as reinforcers or punishers. This requires that outcomes activate preferences. These preferences have to be state dependent. If a network is going to need electrical power, it must have a preference for such power. Obtaining and receiving such power should be reinforcing. They must also have consummatory behavior such as recognition of mate. The actual animal functions are important because intelligence grows out of actual, real world functions. Cross-species domains collected from readings to date include the following, each of which is a candidate for specialized neural networks: Mate selection; attachment and caring; pecking order; prey defense; predator action; way finding; food selection; choice in foraging; food sharing; migration; communication; social cohesion; recognition.
Animals, including humans, pass through a series of ordered stages of development (see “Introduction to the Model of Hierarchical Complexity,” World Futures, 64: 444-451, 2008). Behaviors performed at each higher stage of development always successfully address task requirements that are more hierarchically complex than those required by the immediately preceding order of hierarchical complexity. Movement to a higher stage of development occurs by the brain combining, ordering, and transforming the behavior used at the preceding stage. This combining and ordering of behaviors must be non-arbitrary.
The model identifies fifteen orders of hierarchical complexity of tasks and fifteen stages of hierarchical complexity in development of performance on those tasks. According to this model, individual tasks are classified by their highest order of hierarchical complexity. The model is used to deconstruct tasks into the behaviors that must be learned at each order in order to build the behavior needed to successfully complete a task.
Hierarchical stacked computer neural networks based on Commons et al.'s (1998) Model recapitulate the human developmental process. Thus, they learn the behaviors needed to perform increasingly complex tasks in the same sequence and manner as humans. This allows them to perform high-level human functions such as monitoring complex human activity and responding to simple language (Commons and White, 2003, 2006).
They can consist of up to fifteen architecturally distinct neural networks ordered by order of hierarchical complexity. The number of networks in a stack depends on the hierarchical complexity of the task to be performed. The type of processing that occurs in a network corresponds to its stage that successfully addresses the tasks of that hierarchical complexity in the developmental sequence. In solving a task, information moves through each network in ascending order by stage. Training is done at each stage. The training is done until the network correctly addresses the task in a reasonable amount of the time Valued consequences are delivered at each layer representing each stage. This is in contrast to Carpenter and Grossberg (1990, 1992) who delivered feedback at just the highest stage.
The task to be performed is first analyzed to determine the sequence of behaviors needed to perform the task and the stages of development of the various behaviors of trial performances. The number of networks in the stack is determined by the highest order behavior that must be performed to complete the task. Behaviors are assigned to networks based on their order of hierarchical complexity. Stacked neural networks are straightforward up to the nominal order. However, a Nominal stage 4 concept cannot be learned without experience of the concrete thing named. There has to be actual reinforcement in relation to recognizing and naming that real object.
The sense of touch, weight, and all sensory stimuli need to be experienced as the concrete “it” that is assigned the nominal concept. Virtual reality software programming techniques might generate such concretely experienced circumstances. The use of holograms may work effectively for such purposes.
Although historically, androids are thought to look like humans, there are other versions, such as R2-D2 and C-3PO droids, which were less human. One characteristic that evolution might predict is eventually they will be independent of people. They will be able to produce themselves. They will be able to add layers to their neural networks as well as a large range of sensors. They will be able to transfer what one has learned (memes) to others as well as offspring in minutes. Old models will have to die. They will have to resist dying. But as older, less capable, and more energy-intensive droids abound, the same evolutionary pressure for replacement will exist. But because evolution will be both in the structure of such droids, that is, the stacked neural networks, the sensors and effectors, and also the memes embodied in what has been learned and transferred, older ones are somewhat immortal. Their experience may be preserved.
We are already building robots for all manufacturing purposes. We are even using them in surgery and have been using them in warfare for seventy years. More and more, these robots are adaptive on their own. There is only a blurry line between a robot that flexibly achieves its goal and a droid. For example, there are robots that vacuum the house on their own without intervention or further programming. These are stage 2 performing robots. There are missiles that, given a picture of their target, seek it out on their own. With stacked neural networks built into robots, they will have even greater independence. People will produce these because they will do work in places people cannot go without tremendous expense (Mars or other planets) or not at all or do not want to go (battlefields). The big step is for droids to have multiple capacities—multi-domain actions. The big problem of moving robots to droids is getting the development to occur in eight to nine essential domains. It will be necessary to make a source of power (e.g., electrical) reinforcing. That has to be built into stacked neural nets, by stage 2, or perhaps stage 3. For droids to become independent, they need to know how to get more electricity and thus not run down. Because evolution has provided animals with complex methods for reproduction, it can be done by the very lowest-stage animals.
Self-replication of droids requires that sufficient orders of hierarchical complexity are achieved and in stable-enough operation for a sufficient basis to build higher stages of performance in useful domains. Very simple tools can be made at the Sentential state 5 as shown by Kacelnik's crows (Kenward, Weir, Rutz, and Kacelnik, 2005). More commonly by the Primary stage 7, simple tool-making is extensive, as found in chimpanzees. Human flexible tool-making began at the Formal stage 10 (Commons and Miller, 2002), when special purpose sharpened tools were developed. Each tool was experimental, and changed to fit its function. Modern tool making requires Systematic and Metasystematic stage design. When droids perform at those stages, they will be able to make droids themselves and change the designs.
Droids could choose to have various parts of their activity and programming shared with specific other droids, groups, or other kinds of equipment. The data could be transmitted using light or radio frequencies or over networks. The assemblage of a group of droids could be considered a Super Droid. Members of a Super Droid could be in many places at once, yet think things out as a unit. Whether individually or grouped, droids as conceived here will have significant advantages over humans. They can add layers upon layers of functions, including a multitude of various sensors. Their expanded forms and combinations of possible communications results in their evolutionary superiority. Because development can be programmed in and transferred to them at once, they do not have to go through all the years of development required for humans, or for Superions (see “Genetic Engineering and the Speciation of Superions from Humans,” this issue). Their higher reproduction rate, alone, represents a significant advantage. They can be built in probably several months' time, despite the likely size some would be. Large droids could be equipped with remote mobile effectors and sensors to mitigate their size. Plans for building droids have to be altered by either humans or droids. At the moment, humans and their decedents select which machine and programs survive.
One would define the nature of those machines and their programs as representing memes. For evolution to take place, variability in the memes that constitute their design and transfer of training would be built in rather easily. The problems are about the spread and selection of memes. One way droids could deal with these issues is to have all the memes listed that go into their construction and transferred training. Then droids could choose other droids, much as animals choose each other. There then would be a combination of memes from both droids. This would be local “sexual” selection.
This general scenario poses an interesting moral question. For 30,000 years humans have not had to compete with any species. Androids and Superions in the future will introduce competition with humans. There will be even more pressure for humans to produce Superions and then the Superions to produce more superior Superions. This is in the face of their own extinction, which such advances would ultimately bring. There will be multi-species competition, as is often the evolutionary case; various Superions versus various androids as well as each other. How the competition proceeds is a moral question. In view of LaMuth's work (2003, 2005, 2007), perhaps humans and Superions would both program ethical thinking into droids. This may be motivated initially by defensive concerns to ensure droids' roles were controlled. In the process of developing such programming, however, perhaps humans and Superions would develop more hierarchically complex ethics, themselves.
If contemporary humans took seriously the capabilities being developed to eventually create droids with cognitive intelligence, what moral questions should be considered with this possible future in view? The only presently realistic speculation is that Homo Sapiens would lose in the inevitable competitions, if for no other reason that self replicating machines can respond almost immediately to selective pressures, while biological creatures require many generations before advantageous mutations can be effectively available. True competition between human and machine for basic survival is far in the future. Using the stratification argument presented in “Implications of Hierarchical Complexity for Social Stratification, Economics, and Education”, World Futures, 64: 444-451, 2008, higher-stage functioning always supersedes lower-stage functioning in the long run.
Efforts to build increasingly human-like machines exhibit a great deal of behavioral momentum and are not going to go away. Hierarchical stacked neural networks hold the greatest promise for emulating evolution and its increasing orders of hierarchical complexity described in the Model of Hierarchical Complexity. Such a straightforward mathematics-based method will enable machine learning in multiple domains of functioning that humans will put to valuable use. The uses such machines find for humans remains an open question.    Bostrom, N. 2003. Cognitive, emotive and ethical aspects of decision making. In Humans and in artificial intelligence, vol. 2, Eds. Smit, I., et al., 12-17. Tecumseh, ON: International Institute of Advanced Studies in Systems Research and Cybernetics.    Bostrom, N., and Cirkovic, M., Eds. Forthcoming. Artificial intelligence as a positive and negative factor in global risk. In Global catastrophic risks, Oxford: Oxford University Press.    Carpenter, G. A., and Grossberg, S. 1990. System for self-organization of stable category recognition codes for analog patterns. U.S. Pat. No. 4,914,708, filed (n.d.) and issued    Apr. 3, 1990. (Based on Carpenter, G. A. and Grossberg, S. 1987. ART 2: Selforganization of stable category recognition codes for analog input patterns. Applied Optics: Special Issue on Neural Networks 26: 4919-4930.)    Carpenter, G. A., and Grossberg, S. 1992. System for self-organization of stable category recognition codes for analog patterns. U.S. Pat. No. 5,133,021, filed Feb. 28, 1990, and issued Jul. 21, 1992. (Based on Carpenter, G. A. and Grossberg, S. 1987. ART 2: Selforganization of stable category recognition codes for analog input patterns. Applied Optics: Special Issue on Neural Networks 26: 4919-4930.)    Commons, M. L., and Miller, P. M. 2002. A complete theory of human evolution of intelligence must consider stage changes: A commentary on Thomas Wynn's Archeology and Cognitive Evolution. Behavioral and Brain Sciences 25(3): 404-405.    Commons, M. L. and Pekker, A. 2007. A new discounting model of reinforcement. Unpublished manuscript, available from commons @tiac.net    Commons, M. L., Trudeau, E. J., Stein, S. A., Richards, F. A., and Krause, S. R. 1998. The existence of developmental stages as shown by the hierarchical complexity of tasks. Developmental Review 8(3): 237-278.    Commons, M. L., and White, M. S. 2003. A complete theory of tests for a theory of mind must consider hierarchical complexity and stage: A commentary on Anderson and Lebiere target article, The Newell Test for a theory of mind. Behavioral and Brain Sciences 26(5): 20-21.    Commons, M. L., and White, M. S. 2006. Intelligent control with hierarchical stacked neural networks. U.S. Pat. No. 7,152,051, filed Sep. 30, 2002, and issued Dec. 19, 2006.    Kenward, B., Weir, A. A. S., Rutz, C., and Kacelnik, A. 2005. Tool manufacture by naïve juvenile crows. Nature 433(7022): 121. DOI 10.1038/433121a.    LaMuth, J. E. 2003. Inductive inference affective language analyzer simulating artificial intelligence. U.S. Pat. No. 6,587,846, filed Aug. 18, 2000, and issued Dec. 5, 2000.    LaMuth, J. E. 2005. A diagnostic classification of the emotions: A three-digit coding system for affective language. Lucerne Valley, Calif.: Reference Books of America.    LaMuth, J. E. 2007. Inductive inference affective language analyzer simulating artificial intelligence. U.S. Pat. No. 7,236,963, filed Mar. 11, 2003, and issued Jun. 26, 2007.    Reilly, M., and Robson, D. 2007. Baby's errors are crucial first step for a smarter robot.
New Scientist, 196(2624): 30.
Unsolicited Communications
Spam is unsolicited and unwanted “junk” email, often of a commercial or distasteful nature, that email users prefer not to receive (as opposed to “clean” email messages that users receive from their colleagues and business associates). To protect users from spam, many email providers have spam filters, which either delete unwanted messages immediately, send unwanted messages to a separate “spam” folder, or send users a digest of all the spam messages that they can quickly review to make sure there is nothing of interest. These spam filters typically operate by excluding messages that come from certain senders, include certain attachments or contain certain words, or by permitting messages only from authorized senders. Prior art spam filtering techniques are discussed in several issued US patents. For example, in U.S. Pat. No. 7,299,261, incorporated herein by reference, Oliver discusses an exemplary message classification technique based on verifying the signature on the message (certain email addresses are known sources of spam) and reviewing the content for key information, for example if it includes a word or phrase that is indicative of spam. In U.S. Pat. No. 7,680,886, incorporated herein by reference, Cooley mentions a machine learning based spam filter. Under Cooley's scheme, messages that an owner of an email account sends are defined to be clean. Messages that the owner receives are initially classified as spam or clean based on preset criteria, but user corrections are taken into account, so it is hoped that over time the spam filter becomes more accurate. Cooley suggests that a Bayesian classifier or a support vector machine can be used as a spam/clean classifier. In addition, Cooley notes that a message might be passed through a fast, non-machine learning based spam filter before going through a machine learning based spam filter due to the fact that the non-machine learning spam filter is faster and could reduce the burden on the operation of the machine learning spam filter by removing the most obvious spam messages most quickly and leaving only more difficult cases to the machine learning filter.
Typical neural networks are not modeled on the cognitive development of the human brain. However, the inventors have developed a cognitive hierarchical stacked neural network. See, U.S. Pat. No. 7,152,051, expressly incorporated herein by reference.
The simplest prior-art artificial neural networks (ANNs) comprise an interconnected set of artificial neurons. Signals pass between artificial neurons over predetermined connections. Each neuron typically receives signals from a number of other neurons. Each connection between one neuron and another has a weight associated with it that represents the strength of the sending neuron's signal. In more advanced paradigms, the weight can change based on a pattern of activity of signals over the connection, or signals over other connections. This change can be persistent, or revert to the nominal response, over time, etc. An activation function associated with the receiving neuron multiplies and sums the weights of the signals that it receives from other neurons and computes whether the neuron will fire. When the neuron fires, it sends signals that either activate or inhibit other internal neurons or cause the network to output an external response. In more advanced paradigms, the neuron output can be an analog value or time-variant function. Connection weights between neurons are adjusted, e.g., by training algorithms based on the neural network's production of successful outputs. These connection weights comprise the neural network's knowledge or learning.
To increase the capacity of prior-art neural networks to solve problems accurately and to expand their abstract abilities, some prior-art neural networks comprise more than one neural network. Architecturally distinct neural networks are linked to other networks hierarchically, in parallel, in tree structures, or in other configurations. Such linked neural networks allow greater levels of abstraction and multiple views of problems. In prior-art neural networks that are linked hierarchically, information moves up through the system of neural networks, with output from each lower-level neural network cascading up to the level above it. The lower levels identify patterns based on the input stimuli. These patterns are then fed to the higher levels, with input noise reduced and with increasingly narrow representations identified, as output from one neural network moves to the next. In this movement through the series of networks, a winnowing process takes place, with information reduced as decisions are made concerning the identity of the object or concept represented by a pattern. In the process of eliminating the noise in the input stimuli, the complexity, subtlety, and meaning of information may be lost. Neural networks at higher levels operate on information more remote from the raw data than neural networks at lower levels, and their tasks become more abstract. The result is that certain complexity and context, which might be critical for decision-making and data interpretation, are lost. Therefore, when an ANN at one hierarchical level in a stacked network is dedicated to a new task, if its training does not require it to preserve particular aspects of the input, this will be lost from higher level consideration.
Motor network control systems, or computers which control external mechanical devices, are known in the art. See, e.g., U.S. Pat. Nos. 6,686,712, 5,576,632, and US App. 2008/0144944, each of which is expressly incorporated herein by reference. Genetic algorithms are search or computation techniques to find exact or approximate solutions to optimization and search problems. See, generally, Wikipedia: Genetic Algorithm, available at en.wikipedia.org/wiki/Genetic_algorithm, last accessed May 18, 2010. Several models and uses of genetic algorithms are known in the art. See, e.g., US App. 2010/0103937, US App. 2010/0094765, US App. 2009/0327178, US App. 2009/0319455, US App. 2009/0307636, US App. 2009/0271341, US App. 2009/0182693, US App. 2009/0100293, US App. 2009/0012768, US App. 2008/0267119, US App. 2008/0140749, US App. 2008/0109392, US App. 2008/0010228, US App. 2007/0251998, US App. 2007/0208691, US App. 2007/0166677, US App. 2007/0133504, US App. 2007/0106480, US App. 2007/0094164, US App. 2007/0094163, US App. 2007/0024850, US App. 2006/0230018, US App. 2006/0229817, US App. 2005/0267851, US App. 2005/0246297, US App. 2005/0198182, US App. 2005/0197979, US App. 2005/0107845, US App. 2005/0088343, US App. 2005/0074097, US App. 2005/0074090, US App. 2005/0038762, US App. 2005/0005085, US App. 2004/0210399, US App. 2004/0181266, US App. 2004/0162794, US App. 2004/0143524, US App. 2004/0139041, US App. 2004/0081977, US App. 2004/0047026, US App. 2004/0044633, US App. 2004/0043795, US App. 2004/0040791, US App. 2003/0218818, US App. 2003/0171122, US App. 2003/0154432, US App. 2003/0095151, US App. 2003/0050902, US App. 2003/0046042, US App. 2002/0156752, U.S. Pat. Nos. 7,698,237, 7,672,910, 7,664,094, 7,657,497, 7,627,454, 7,620,609, 7,613,165, 7,603,325, 7,552,669, and 7,502,764, each of which is expressly incorporated herein by reference.