1. Field of the Invention
This invention relates to (artificial) neutral networks (in other words, to parallel processing apparatus comprising or emulating a plurality of simple, interconnected, neural processors, or to apparatus arranged to emulate parallel processing of this kind) and particularly, but not exclusively, to their use in pattern recognition problems such as speech recognition, text-to-speech conversion, natural language translation and video scene recognition.
2. Related Art
Referring to FIG. 1, one type of generalized neural net known in the art comprises a plurality of input nodes 1a, 1b, 1c to which an input data sequence is applied from an input means (not shown), and a plurality of output nodes 2a, 2b, 2c, each of which produces a respective net output signal indicating that the input data sequence satisfied a predetermined criterion (for example, a particular word or sentence is recognized or an image corresponding to a particular object is recognized). Each output node is connected to one or more nodes in the layer below (the input layer) by a corresponding connection including a weight 3a-3i which scales the output of those nodes by a weight factor to provide an input to the node in the layer above (the output layer). Each node output generally also includes a non-linear (compression) stage (not shown).
In many such nets, further intermediate inner or `hidden` layers are included, which receive inputs from a layer below and generate outputs for a layer above. The output of a node in general is a function of its weighted inputs; typically the function is the sum of these inputs, with the subsequent non-linear compression mentioned above. One example of such a net is the well known Multi-Layer-Perceptron (MLP).
Such nets are trained in a training phase by inputting training data sequences which are known to satisfy predetermined criteria, and iteratively modifying the weight values connecting the layers until the net outputs approximate the desired indications of such criteria. Having been trained on a range of training data, it is then found that such trained networks can operate upon real-world data to perform various processing and recognition tasks.
Since the revival of interest in neural nets in recent years such attention has focussed on nets in which processing is unequivocally parallel and distributed, (Rumelhart 1986) and which have recently proved to be admirably suited to tackling problems in signal processing eg (Lynch & Rayner 1989) pattern recognition eg (Hutchinson & Welsh 1989) (Woodland & Smythe 1990) and robotic control eg (Saerens & Soquet 1989). Some attention has also been paid to problems which cannot be seen as signal processing, and in particular various methods of applying neural nets to natural language have been described, from (Rumelhart 1986) and (McClelland & Kawamoto 1986) through to recent papers and reports (Sharkey 1989) , (Weber 1989) and (Jagota & Jajubowitz 1989). A difficulty in these cases is how to present inputs to the net. If unlimited data such as text is to be processed by a neural net of these kinds, either it must be input as some set of lower level features--letters or microfeatures as described in eg (Rumelhart et al 1986), --or if whole words or larger features are to be used the number of input nodes must be very great. In the latter case, too, some retreat from the pure concept of parallel distributed processing must be accepted, since each word can be seen as locally stored.
In other words, the choice is typically between using too few nodes (in which case the network may not train well if features chosen are inappropriate) or too many (in which case the network is tending to act as a simple look up store).
Another problem is that a very large number of iterations can be required for convergence in training, which can consequently be slow and laborious.
In their paper entitled "Learning to understand sentences in a connectionist network", published in the proceedings of the IEEE International Conference on Neural Networks, San Diego, 24-27 Jul. 1988, pages II 215 to II 219, Nolfi and Paris; describe a "Jordan Architecture" net which is trained by back propagation. The net is a kind of multi-layer perceptron in which there are input units, output units and hidden units. Associated with each hidden unit is a corresponding memory unit. Each memory unit makes a temporary copy of each state of its associated hidden unit and then supplies this copy to the hidden unit in the next cycle (when the system processes the next stimulus).
The memory units only store information temporarily. The information stored in the memory units does not appear to correspond to the "new features" specified herein. The information stored in the memory units is not used to modify the input layer in any way.
In another paper at the same conference, at pages II 234-242, Tenorio et al discuss the NETtalk system applied to Spanish and English. In paragraph 5.2 of that paper they discuss the effects of using networks having difference numbers of hidden units. Unsurprisingly, when the network has very many hidden units (there being at least as many hidden units as there are training patterns), rather than only a few, there is a dramatic change in the performance in the back propagation algorithm. With many hidden units the network can of course operate as what is effectively a look up table. There is no suggestion either that there is an optimum number of hidden units or that the number of hidden units in a particular network should be altered dynamically or in any other way.
Ekeberg, in a paper entitled "Automatic generation of internal representations in a probabilistic artificial neural network", published in "Neural Networks from Models to Applications", I.D.S.E.T. Paris 1989, at pages 178 to 186, considers adding layers and features to what is initially a single layer feedback perceptron type network. Higher level features code for suitable combinations of simultaneously input unit activity. The higher level features are present in a separate layer, communicating through connections with the input/output layer. Initially the internal layer contains one unit for each in/out unit, that is the internal code is initially the same as the external. During training, the internal layer is gradually transformed by replacing existing units with units coding higher order co-activity in the In/Out layer. Ekeberg describes how the appropriate internal units are chosen: he selects the two internal units with the highest interdependency and replaces them with three more specific ones, one being active when both of the old ones were, and each of the other being active when only one was active. After such a replacement, the sample patterns are scanned again to get estimates of the new probabilities involved. This process of replacing two internal units by three is repeated until the task is solved. In the worst case, when no useful regularities are detected, so-called "Grandmother cells" corresponding to the individual training patterns will develop. Thus the network grows into what is effectively a look-up table, which of course is very memory intensive. Ekeberg states that in the normal case, however, a successful representation emerges much sooner.
Ekeberg does not suggest either the possibility or desirability of limiting the replacement of units for the normal case. Ekeberg is also silent as to whether or how the creation of "grandmother cells" can be limited or inhibited with or without any deterioration of the "normal case" performance.
In EP 0327817 there is described an associated pattern conversion system in which, during training, connection weights are adjusted between pre-determined, fixed maximum and minimum values. The maxima and minima are fixed in advance, and preferably with only a small range between them, in order that a simple circuit can be used. When a weight reaches its pre-determined maximum or minimum, it is said to be saturated. The weight modification function is a monotonically decreasing function. There is no suggestion of adding internal nodes or modifying the input layer to be responsive to new features derived from higher level features.