Artificial neural networks (ANNs) today provide many established methods for signal processing, control, prediction, and data modeling for complex nonlinear systems. The terminology for describing ANNs is fairly standardized. However, a brief review of the basic ideas and terminology is provided here.
A typical ANN consists of a finite number K of units, which at a discrete time t (where t=1,2,3 . . . ) have an activation xi(t)·(i=1, . . . , K ). The units are mutually linked by connections with weights wji, (where i, j=1, . . . , K and where wji is the weight of the connection from the i-th to the j-th unit), which typically are assigned real numbers. A weight wji=0 indicates that there is no connection from the i-th to the j-th unit. It is convenient to collect the connection weights in a connection matrix W=(wji)j,i=1, . . . K. The activation of the j-th unit at time t+1 is derived from the activations of all network units at time t by
                                                        x              j                        ⁡                          (                              t                +                1                            )                                =                                    f              j                        (                                          ∑                                                      i                    =                    1                                    ,                                                                          ⁢                  …                  ⁢                                                                          ,                                                                          ⁢                  K                                            ⁢                                                w                  ji                                ⁢                                                      x                    i                                    ⁡                                      (                    t                    )                                                                        )                          ,                  K          ⁢                                          ⁢          statt          ⁢                                          ⁢          N                                    (        1        )            where the transfer function ƒj typically is a sigmoid-shaped function (linear or step functions are also relatively common). In most applications, all units have identical transfer functions. Sometimes it is beneficial to add noise to the activations. Then (1) becomes
                                                        x              j                        ⁡                          (                              t                +                1                            )                                =                                                    f                j                            (                                                ∑                                                            i                      =                      1                                        ,                                                                                  ⁢                    …                    ⁢                                                                                  ,                                                                                  ⁢                    K                                                  ⁢                                                      w                    ji                                    ⁢                                                            x                      i                                        ⁡                                          (                      t                      )                                                                                  )                        +                          v              ⁡                              (                t                )                                                    ,                  K          ⁢                                          ⁢          statt          ⁢                                          ⁢          N                                    (                  1          ′                )            where v(t) is an additive noise term.
Some units are designated as output units; their activation is considered as the output of the ANN. Some other units may be assigned as input units; their activation xi(t) is not computed according to (1) but is set to an externally given input ui(t), i.e.xi(t)=ui(t)  (2)in the case of input units.
Most practical applications of ANNs use feedforward networks, in which activation patterns are propagated from an input layer through hidden layers to an output layer. The characteristic feature of feedforward networks is that there are no connection cycles. In formal theory, feedforward networks represent input-output functions. A typical way to construct a feedforward network for a given functionality is to teach it from a training sample, i.e. to present it with a number of correct input-output-pairings, from which the network learns to approximately repeat the training sample and to generalize to other inputs not present in the training sample. Using a correct training sample is called supervised learning. The most widely used supervised teaching method for feedforward networks is the backpropagation algorithm, which incrementally reduces the quadratic output error on the training sample by a gradient descent on the network weights. The field had its breakthrough when efficient methods for computing the gradient became available, and is now an established and mature subdiscipline of pattern classification, control engineering and signal processing.
A particular variant of feedforward networks, radial basis function networks (RBF networks), can be used with a supervised learning method that is simpler and faster than backpropagation. (An introduction to RBF networks is given in the article “Radial basis function networks” by D. Lowe, in: Handbook of Brain Theory and Neural Networks, M. A. Arbib (ed.), MIT Press 1995, p. 779-782) Typical RBF networks have a hidden layer whose activations are computed quite differently from (1). Namely, the activation of the j-th hidden unit is a functiongj(∥u−vj∥)  (3)of the distance between the input vector u from some reference vector vj. The activation of output units follows the prescription (1), usually with a linear transfer function. In the teaching process, the activation mechanism for hidden units is not changed. Only the weights of hidden-to-output connections have to be changed in learning. This renders the learning task much simpler than in the case of backpropagation: the weights can be determined off-line (after presentation of the training sample) using linear regression methods, or can be adapted on-line using any variant of mean square error minimization, for instance variants of the least-mean-square (LMS) method.
If one admits cyclic paths of connections, one obtains recurrent neural networks (RNNs). The hallmark of RNNs is that they can support self-exciting activation over time, and can process temporal input with memory influences. From a formal perspective, RNNs realize nonlinear dynamical systems (as opposed to feedforward networks which realize functions). From an engineering perspective, RNNs are systems with a memory. It would be a significant benefit for engineering applications to construct RNNs that perform a desired input-output-dynamics. However, such applications of RNNs are still rare. The major reason for this rareness lies in the difficulty of teaching RNNs. The state of the art in supervised RNN learning is marked by a number of variants of the backpropagation through time (BPTT) method. A recent overview is provided by A. F. Atiya and A. G. Parlos in the article “New Results on Recurrent Network Training: Unifying the Algorithms and Accelerating Convergence”, IEEE Transactions on Neural Networks, vol. 11 No 3 (2000), 697-709. The intuition behind BPTT is to unfold the recurrent network in time into a cascade of identical copies of itself, where recurrent connections are re-arranged such that they lead from one copy of the network to the next (instead back into the same network). This “unfolded” network is, technically, a feedforward network and can be teached by suitable variants of teaching methods for feedforward networks. This way of teaching RNNs inherits the iterative, gradient-descent nature of standard backpropagation, and multiplies its intrinsic cost with the number of copies used in the “unfolding” scheme. Convergence is difficult to steer and often slow, and the single iteration steps are costly. By force of computational costs, only relatively small networks can be trained. Another difficulty is that the back-propagated gradient estimates quickly degrade in accuracy (going to zero or infinity), thereby precluding the learning of memory effects of timespans greater than approx. 10 timesteps. These and other difficulties have so far prevented RNNs from being widely used.