Computers can store, modify, and retrieve large amounts of data much more quickly than humans. They are also much more accurate and precise in their computations and less prone to error than most conscientious human beings. However, computers cannot cope with many of the simple tasks that humans perform every day. In particular, they completely fail in generalizing and guessing. Also they have great difficulty working with either partial or noisy information. For this reason scientists have designed parallel distributed processors that consist of a vast network of neuron-like units. These systems are called Artificial Neural Networks. Although they could be built in hardware, typically they are simulated on powerful conventional computers.
The simplest neural network contains an input layer and an output layer. Each layer consists of a row of neuron-like units, or nodes. Each node in the output layer is connected to every node in the input layer through synaptic links. The conductivities of these links are called "weights". In such a network the signal at an output node is computed by summing up the product of each input signal with the weight of the corresponding synaptic link connecting to this output node. The output signals, called "activations", are then compared with the "target" values (desired signals) assigned to these nodes. A portion of the error, i.e., the difference between the signal of an output node and the target value assigned to this output node, is used to change the weights in order to reduce the error. The most commonly used method of error minimization is called the "delta rule".
In effect, a neural network generates its output vector (a collection of activations of the output nodes) from an input vector (a collection of input signals) and compares the output vector with the target vector (a collection of desired outputs). This process is called "learning". If the output and target vectors are identical, no learning takes place. Otherwise the weights are changed to reduce the error. A neural network learns a mapping function between the input and target vectors by repeatedly observing patterns from a training set and modifying the weights of its synaptic links. Each pass through the training set is called a "cycle". Typically, a learning process (also called "training") consists of many thousands of cycles, and takes from several minutes to several hours to execute on a digital computer.
What is described above is the well known learning paradigm called "pattern association". In this paradigm the task is to associate a set of input patterns with an output pattern. A set of input patterns can be, for example, a number of curves that belong to the same class while differing from each other. Another learning paradigm is the "auto association" in which an input pattern is associated with itself. The goal here is pattern completion. When an incomplete pattern is presented, the auto associator restores it to its original form. In both paradigms a teaching input in the form of repetitive presentations of a number of sets of input patterns associated with a number of output patterns is required. This is called "supervised learning".
It is found that the simple neural networks consisting of a row of input nodes and row of output nodes (also described as a "single-layer" network) cannot learn mappings that have very different outputs from very similar inputs. For example, they cannot learn based on the exclusive-or function. It is found that for the neural network to learn such arbitrary patterns, they must have more than two rows of nodes. In such "multi-layer" networks the rows between the input and output layers are called "hidden layers". The most commonly used multi-layer neural networks are known as the "backpropagation" systems. In these networks learning takes place by the propagation of error in the forward as well as the backward direction. This two way propagation of errors results in a complex learning process and further increases the training time. It was also discovered that linear networks cannot compute more in multiple layers than they can in a single layer. Because of this, in backpropagation networks nonlinearities are introduced at the hidden and output nodes using sigmoid functions.
In a single-layer linear network the error function is always smooth and the error surface is bowl-shaped. The delta rule mentioned above uses the "gradient descent" method. In this method the derivative of the error measure (sum of the squares of errors) with respect to each weight is proportional to the weight change with a negative constant of proportionality (proportion of the dictated weight change is called "learning rate"). This corresponds to performing the steepest descent on a surface in weight space. The height of the surface at any point is equal to the error measure. Because of this, the delta rule has no difficulty locating the minimum point of the bowl-shape error surface of a single-layer linear network. Multi-layer networks, however, can have more complex error surfaces with many minima. Only one of these is the "global minima" in which the system reaches an errorless state. The others are called "local minima". As a result, there is real possibility of a multi-layer network getting "stuck" in a local minima.
To reduce the training time, the learning rate is set as high as possible. However, this causes the error measure to oscillate. A "momentum term" is added to the weight change equation so that a high learning rate can be used while avoiding oscillations. The coefficient of the momentum term determines what portion of the previous weight change will be added to the current weight change.
In what follows, sets of patterns, features, curves, hyper surfaces and objects will be referred to as "patterns" and functions of the neural network such as identification, classification, detection, pattern association or auto association will be referred to as "identification".
A detailed description of artificial neural networks is provided in the reference entitled "Parallel Distributed Processing" by Rumelhart et al. (1986), Vol. 1, the MIT Press, Cambridge, Mass.
During the evolution of artificial neural networks the following three problems have been discovered:
(i) Single-layer networks cannot learn mappings that have very different outputs from very similar inputs. PA1 (ii) Linear networks cannot compute more in multiple layers than they can in a single layer. PA1 (iii) Multilayer nonlinear networks that use the delta rule suffer from the problems of local minima. PA1 (i) map any input vector to any output vector without the use of hidden layers; PA1 (ii) be free from the problems of local minima; PA1 (iii) reduce the sum of the square of errors over the output nodes to 0.000000 in fewer than ten cycles.
It is a principal object of the present invention to provide a neural network architecture and operating method that is not subject to the aforementioned problems and disadvantages.
It is an additional object of the present invention to provide an architecture and operating method that enables a neural network to:
While the specification concludes with claims defining the features of the invention that are regarded as novel, it is believed that the invention, together with further objects thereof, will be better understood from a consideration of the following description in conjunction with the drawing figures.