With the proliferation and size of data sets being generated over the pest decade or so, there has been much interest in developing tools that can be used to find relationships within data sets, where the data sets are not understood explicitly. It is desirable that the tools with which data can be explored are able to learn data sets consistently every time in a fixed amount of time to allow salient information about the relationships between the input and output to be easily determined.
One tool used to explore data is the feed-forward neural network. Feed-forward neural networks have attracted much attention over the past 40 years or so as they have been used to perform many diverse and difficult tasks with data sets. These include pattern classification, and function approximation, because they have the ability to ‘generalise’. Hence neural networks (hereinafter simply referred to as a “NNs” or “NN”) can be used in applications like non-linear system modeling and image compression and reconstruction.
NNs are of interest to many fields, these include science, commerce, medicine and industry as they can be given data sets where it is not known what relationships are inherent within the data and the NN can learn how to classify the data successfully.
In some cases the data may not have been subject to any prior classification, and in these circumstances it is common to use unsupervised training, such as self-organising maps, to classify the data. In other cases the data may have been previously broken into data samples that have been classified, and in these circumstances it is common to train a NN to be able to classify the additional unclassified data. In the latter case, a supervised learning algorithm is traditionally used. Classified input data examples have an associated output and during training, the NN learns to reproduce the desired output associated with the input vector. Feed-forward NNs are traditionally trained using supervised training methods.
Artificial NNs are composed of a number of neurons, which are sometimes called units or nodes. They take their inspiration from biological neurons. Neurons are connected together to form networks. Each neuron has input which may be from many other neurons. The neuron produces output in response to the input by either firing or not. The neuron's output may then provide input to many other neurons. This is the basic structure of a feed-forward NN.
Typically neurons form layers. In feed-forward NNs there are three types of layers, input, hidden and output layers. The first layer is the input layer, which has one or more neurons in it. There is also an output layer that may have one or more neurons as well. A NN may also have one or more hidden layers. All neurons in the input layer present their output to the next layer, which may be the output layer or the first hidden layer, if there are more than one hidden layers. If there is only one hidden layer, the neurons in the hidden layer will then in turn report their output to the output layer. If there are more than one hidden layers, then, those neurons will feed their output into the input of the neurons in the next hidden layer and so on, until the last hidden layer's neurons feed their output into the input of the output layer.
Other network architectures are possible, where the NN is specifically designed to learn a particular data set. This is seen especially in NNs learning sequences of input vectors, which may have feedback loops in the connections. These NNs am called recurrent feed-forward NNs and commonly the output of the NN can often be feedback into the input of the NN.
The first biological neuron model was developed by McCulloch and Pitt in 1943. This model became known as the McCulloch-Pitt neuron. The McCulloch Pitt neuron model or linear threshold gate (hereinafter simply referred to as “LTG” or “LTGs”) is defined as having a number of input connections and each connection has a weight associated with it. The input is defined mathematically as a vector, xiε{0,1}n, where n is a positive integer indicating the number of input into the LTG and i is the input vector. Since there are n input connections, the connection weights can be defined mathematically as a vector, w, where wεRn. Each input vector into the LTG is multiplied by its associated weight, this can be expressed mathematically as xi·w and the result is compared to the LTGs threshold value, T, where TεR. The output will be 1 if xi·w≧T, otherwise xi·w<T and outputs 0. In other words, the LTG uses the step, or Heaviside, function, as the activation function of the neuron.
The LTG can be defined mathematically using the following definitions:w={w1,w2, . . . wn} and xi={x1,x2, . . . xn}Let netn=xi·w and xiε{0, 1}n and wεRn, then the behaviour of the LTG can be summarised in equation 1.1, as follows:xi·w<T→0 and xi·w≧T→1  (1.1)
Thus the output of the LTG, O, is binary {0,1}. The LTG will output 1 if the LTG is activated and 0 if it is not.
The LTG was modified with an additional bias input permanently set to 1 in 1962. The bias input absorbs the threshold value, which is then set to 0. The modified LTG model was renamed the perceptron. The perceptron model allowed threshold, T, to be removed from the xi·w, hence the equations become xi·w<T≡xi·w−T<0 and xi·w≧T≡xi·w−T≧0. Now, the threshold value can become another input into the neuron, with weight, w0, and fixing the input into the neuron to 1 ensures that it is always present, so T=1·w0. The weight, w0 is called the bias weight. So the equations become:xi·w−w0<0 and xi·w−w0≧0.
In 1960, Rosenblatt focused attention on finding numeric values for weights using the perceptron model. From then until now, finding single numerical values for each of the weights in a neuron has been the established method of training neurons and NNs. There have been no attempts to directly find symbolic relationships between the weights and the thresholds, although it is recognised that the relationships formed by the neurons can be expressed using propositional logic. The rules within the data set that the NN learnt during training are encoded as numeric values, which may render them incompressible. There have been attempts to find the rules learnt by the NN from the numbers found by the weights and the thresholds. All these methods are an additional process after training which do not allow the rules to be read directly from the NN.
In 1962, Rosenblatt proved the convergence of the perceptron learning algorithm, which would iteratively find numbers that satisfy linearly separable data sets. The neuron learns by adapting the connection weights so that it will produce a desired output given specific input. Rosenblatt's training rule, as seen in equation 1.2, is that the weights, wj, where 1≦j≦n and n is the number of inputs into the perceptron, are modified based on the input, xi, t is a time step, and a positive gain rate, η, where 0≦η≦1. The Rosenblatt's rule works for binary output. If the output of the perceptron for a particular input is correct, then do nothing.wj(t+1)=wj(t)  (1.2)Otherwise, if the output is 0 and should be 1, then:wj(t+1)=wj(t)+ηxi(t)  (1.3)Or if the output is 1 and should be 0 then:wj(t+1)=wj(t)−ηxi(t)  (1.4)
The idea of iteratively adjusting weights has now become the established method of training feed-forward NNs.
In 1969, it was found that Rosenblatt's learning algorithm would not work for more complex data sets. Minsky and Papert demonstrated that a single layer Perceptron could not solve the famous exclusive or (XOR) problem. The reason why it would not work is because iteration was used to find a single point in the weight-space.
Not all Boolean functions can be learnt by a single LTG. There are 2n combinations of the n input variables, and when combined with the possible output, it means there exists 22n unique Boolean functions (otherwise known as switching functions). Of the 22n functions, only some of them can be represented by a single n-input LTG. Those Boolean functions where the input space is linearly separable can be represented by a single LTG, however additional LTGs are required to learn Boolean functions which are not linearly separable. XOR is an example of a Boolean function that is not linearly separable and hence cannot be learnt by a single LTG.
Using additional layers of LTGs would allow problems that are not linearly separable to be learnt by the NN, however, there was no training rule available that would allow the multiple layers of LTGs to be trained at the time.
As a result, the McCulloch-Pitt model of the neuron was abandoned, as there was no iterative method to find numerical values for the weights and thresholds that would allow multiple layers of LTGs to be trained. This was until backpropagation was developed.
In 1974, Werbos came up with the idea of error backpropagation (or “backpropagation”). Then later in 1986, Rumelhart and Hinton and also Williams in 1986, and in 1985 Parker, also came up with the same algorithm and it allowed the multi-layer NN model to be trained to find numerical values for the weights iteratively. This allowed the XOR problem to be solved as well as many other problems that the single layer perceptron could not solve. The McCulloch-Pitt's neuron model was again modified to use the sigmoid function instead of the step function as its activation function. The mathematical definition of the sigmoid function is given in equation 1.5.O=1/(1+e−kx·w)  (1.5)
The perceptron commonly uses the sigmoid function as the perceptron's activation function. The term k controls the spread of the curve, and the sigmoid function approximates the step-function, as k→∞, the output, O→the step function. However, it is possible to use other activation functions such as tan h(kx·w). This activation function is used if it is required that the NN can output negative numbers, as the range of the function goes from −1 to +1.
Backpropagation is based on Rosenblatt's learning algorithm, which is described by equations 1.2 to 1.4. It is a supervised learning algorithm and works by applying an input vector to the input layer of the NN. The input layer distributes this input to the first hidden layer. The output of each neuron in a layer is calculated according to equation 1.5, which becomes the input into the subsequent layer. This process of calculating the output (or activation) of a layer of neurons which becomes the input to the subsequent layer is repeated until the output of the NN can be calculated. There will be some error between the actual output and the desired output and the weights are modified according to the amount of error. The error in the output is fed back, or propagated back, through the NN, by adjusting the connection weights from the connections into the output layer to the connections on the hidden layers in turn, in order to reduce the error in the NN. The amount the weights are adjusted is directly proportional to the amount of error in the units.
The backpropagation delta rule is given in equation 1.6, where i is the layer, j is the perceptron from which the connection originates in layer i-1, and k is the perceptron to which the connection goes in layer i.wijknew=wijkold+Δwijk  (1.6)WhereΔwijk=ηδijkoijk 
Δwijk is the amount the weights are modified in an attempt to reduce the error in the numeric values on the weights in the NN. The amount that the weights are modified is based on the output of the neuron, oijk, gain term, η, which is also called the learning rate and the error in the output, δijk. The error in the NN is the difference between the actual output and the desired output of the NN.
When the NN is fully trained, it is said to be in a global minimum of the error function as the error in the NN is minimal. Since there are potentially many local minima in the error, the error can be thought of as a surface, which implies it can be a function. However the error function is not known for any NN. The error function can only be calculated empirically as it is based on the difference between the desired output and the actual output for all the input vectors applied to the NN. The term, δijk is the first derivative (the derivative is based on the difference in the error in the output) of the error function. It is the error function that is to be minimised as backpropagation fries to minimise the error in the NN. By taking the gradient (first derivative) it is possible to determine how to change the weights to minimise the error in the NN. This is called gradient-descent.
Backpropagation is required to work on a fixed-sized NN, as there are no allowances in the algorithm for adding or removing neurons from the NN. When training a NN to learn a data set, a guess is made at how many layers and how many neurons in each layer are required to learn the data. After training there may be attempts to improve the trained NNs performance by pruning out neurons that are not required. But during training the number of neurons must remain static.
The traditional backpropagation algorithm can be summarised as follows: (a) Initialisation: Define the number of layers and the number of neurons for each layer in the NN and initialise the NNs weights to random values; (b) Apply an input vector from the training set to the NN. Calculate the output, using equation 1.5, for each neuron in the first layer after the input layer, and use this output as input to the next layer. Repeat this process for each layer of the NN until the output is calculated; (c) Modify the weights according to how much error is present in the NN using equation 1.6; and (d) Repeat steps b) and c) until the NN is deemed trained. The NN is considered trained when the error falls below some arbitrary value for some number of input vectors in the training set.
While there are many benefits associated with training NNs to learn data sets using backpropagation, backpropagation has its limitations. With backpropagation the NN can take a long time to learn a data set or worse still it may never learn a data set at all. In some cases it may not be possible to determine why a NN could not learn the data set and/or it is not possible to distinguish during training whether the NN will ever learn the data set or if its just taking a long time to learn.
With backpropagation the NN may be too small to learn the data. Traditionally, a NN designer must guess how many neurons to use in each hidden layer and also the number of hidden layers that are required to learn the data set. If the NN is too large then it may not be able to generalise properly. Hence, neurons am sometimes pruned from the NN in an attempt to improve this problem. The NN may get stuck in a local minimum of the error space. When the NN has learnt the data set, the NN is in a global minimum of the error space. As the shape of the error function is not known, it has areas of high error and low error. Since backpropagation only moves to minimise the error by examining the first derivative of the error function, it only examines the local region. The aim of training neurons in the hidden layer is to learn different features in the data set. However, when backpropagation propagates error back through the NN, all the weights are modified by some amount, thus possibly reducing each neurons unique association with particular features in the data set. This is possible since a neuron cannot determine whether other neurons in the same layer are learning the same features. This can cause the weights that have learnt a specific data feature to forget the feature.
The main problem with training NNs with backpropagation is that it is not possible to distinguish which of the above reasons is the cause of the NN not learning a data set. It may be learning the data set but its just slow, or it may never learn the data set because the NN is too small, or it may be stuck in a local minimum. A further and significant problem with backpropagation is that when the NN has learnt the data set, what the NN has learnt is incomprehensibly encoded in the weights and thresholds as numbers.
Due to the difficulties of training NNs with backpropagation, much research has gone into developing alternative algorithms to train feed-forward NNs.
Many algorithms have been developed as an alternative to backpropagation for training feed-forward NNs. There are two classes of alternative algorithms, which are: (1) Algorithms that require a fixed number of neurons or resources in the NN; and (2) Those that allow neurons to be allocated dynamically to the NN.
Most of these algorithms rely on having a fixed-sized NN and as a result suffer the same problems backpropagation experiences. One known method uses genetic algorithms to find the values of the weights. Genetic algorithms may avoid the local minima problem but take an indefinite amount of time to train, and also may not train properly because the NN is too small. Another alternative method is to use Radial Basis Functions (RBF) which uses only a single layer to learn the NN, but requires many more input vectors available to it to learn a data set than backpropagation requires. As a result of the problems associated with fixed-sized NNs, it is useful to allow the NN to grow as required to learn the data set.
Feed-forward NN training algorithms, which dynamically add neurons have been introduced as a solution to the problems of pre-defined structure as it gives the flexibility to add neurons only when necessary to ensure features in the data can be learnt. Hence a neuron is added when other neurons cannot learn particular features in the data set and as a result the trained NN can be used more effectively for ascertaining what rules have been learnt by the NN during training. A pre-defined network structure limits a NNs ability to learn data. NNs learn by adapting their weights, which correspond to synaptic weights in biological NNs. As discussed earlier, feed-forward NNs take their inspiration from biological NNs. However, biological NNs dynamically create connections to neurons as required.
There have been two approaches to structurally dynamic algorithms and these are: (1) Those that remove neurons from the NN. Two such approaches to removing neurons from a NN are: (i) Those that work during training such as Rumelhart's Weight Decay, which adds a penalty to the error minimization process; and (ii) The more common approach, those that remove neurons after training, such as Optimal Brain Surgeon, which calculates the impact on global error after removing a weight from the NN; and (2) Those that add neurons to the NN such as Cascade-Correlation Networks (hereinafter “CCN”), Dynamic Node Creation (hereinafter “DNC”), Meiosis and the class of hyperspherical classifiers such as, for example, Restricted Coulomb Energy Classifiers (hereinafter “RCEC”) and Polynomial-Time-Trained Hyperspherical Classifiers (hereinafter “PTTHCs”).
Though there have been many attempts to provide NN training algorithms that work by dynamically allocating neurons into a NN during training, it is considered that none are ideal for classifying data efficiently and/or accurately in a wide variety of circumstances.
The principle reason why NNs are of interest to science and/or industry is because of their ability to find relationships within data, that allows the data to be classified, and then be able to successfully classify input vectors, or patterns, that the NN was not exposed to during training. This powerful property is often referred to as the NNs' ability to ‘generalise’. The input vectors that the NN was not exposed to during training are commonly referred to as unseen patterns or unseen input vectors. For NNs to be able to generalise they require training.
During training a NN learns salient features in the data set it is trained with and can then ‘predict’ the output of unseen input vectors. What the NN can classify depends on what the NN has been trained with.
It is the NNs ability to generalise that allows the NN to deal with noise in the data.
To ensure good generalisation, it is thought that many more training input vectors must be available than the number of weights there are to be trained in the NN.
A NN is deemed trained when it can successfully classify a high ratio of input vectors it has learnt and also the test set. However there may only be a limited number of classified data patterns available to train and test the NN with, so it must be considered how to divide the data set. There are a number of approaches of how to divide a data set to determine how well a NN has been trained so the NN can be tested.
The general method of determining whether a NN is trained is by calculating how much error there is in each input vector when using NNs trained with backpropagation. A skilled person will appreciate the approaches that have previously been used to ascertaining the error in a NN, and as such a detailed discussion of same will not be provided herein.
The attributes that can be used as grounds of comparison between training algorithms will, however, now be discussed.
There are a number of factors that may be considered when comparing learning algorithms so there is an objective measure of the performance.
Typically, in comparisons, the following four attributes of learning algorithms are considered: (1) Accuracy: This is the reliability of the rules learnt during training; (2) Speed: This is a measure of how long it takes for an input vector to be classified; (3) Time to learn: This is a measure of how long it takes to learn an input vector, and (4) Comprehensibility: This is the ability to be able to interpret the rules learnt so the rules can be applied in alternative methods. This strategy is difficult to quantify.
Two of these attributes will be further examined, that of the learning algorithm's time required to learn a data set and the comprehensibility of what has been learnt by the NN.
As discussed earlier, training a NN to learn a data set with backpropagation may require a long time in train as it is possible that the NN may never learn a data set. It has been said that the time it takes to train a fixed-size NN may be exponential. For this reason, how long it takes to train a NN has become a standard of comparison between alternative training algorithms. An ideal training algorithm would require minimal exposure to training input vectors. The minimum possible exposure to training input vectors in the optimal situation would be to expose the NN to each input vector only once to be fully trained. Such a training algorithm can be referred to as a single pass training algorithm.
Of the four attributes commonly used as a basis for comparison between algorithms that train feed-forward NNs, comprehensibility is the least quantifiable, especially for feed-forward NNs trained as numerical values, as the rules learnt by NNs during training are incomprehensibly encoded as numerical values. One method of being able to extract the rules learnt during training is by performing a sensitivity analysis. A sensitivity analysis can be referred to as a measure of robustness against errors.
Rule extraction is of interest as it gives users' confidence in the results produced by the system, and this is especially important when the NN is used in critical problem domains such as medical surgery, air traffic control and monitoring of nuclear power plants, or when theories are deduced from collected data by training NNs, such as in the case of astronomical data.
The rules that are desirable to guarantee comprehensibility are in the form of propositional logic rules relating the input together.
Sensitivity analyses are often performed on NNs, as it is one way of finding out what information has been stored within the NN. This makes performing a sensitivity analysis invaluable to NNs as the rules are encoded often incomprehensibly as numeric values as it is often desirable to find out what rules have been learnt by the NN.
There are two approaches that can be taken with performing a sensitivity analysis on a NN, these are: (1) The effect of modifying the weights; and (2) The effect of applying noisy input to the NN.
If the input space is well known, then it is possible to generate as many data points as necessary, and then finding the output of the NN for input vectors chosen by the following three methods: (1) Finding the output for every point in the data space. If the NN is trained with binary data, the data set is necessarily finite; (2) Randomly choosing data points from the input space; or (3) Selecting every nth data point (where n>1) in the input space. This allows an even distribution over the input space.
Data points can also be selected from regions of the input space where it is not known what the desired NN response will be. In this case, it will show how the NN will respond when given unknown data.
Now that it has been examined how to explore the input-space, the weight-space of neurons in a NN will now be examined.
A system has a number of components that are required to perform as specified which in turn allows the system to perform as required. When each component is performing as specified then the components are said to be in their optimal range.
A sensitivity analysis is an examination of the effect of departing from optimal values or ranges for the components in the system. In this case, the optimal ranges are for the weights in a trained NN. The upper and lower limits are established to find the range (or interval) the weights can vary over without changing the behaviour, in this case, of the NN. To perform a sensitivity analysis, each component in the system is tested in turn while all the other components remain static. The component being tested will be set at all possible values to determine how the system performs. During this process upper and/or lower limits are ascertained for the component which allow the system to behave optimally and it can be observed how the system behaves when the component moves out of these ranges. This process is called ranging. The upper and lower limits can be expressed as constraints
It is considered that known sensitivity analyses do not generate propositional logic rules that relate the input variables together that will make what a NN has learnt comprehensible.
The objective of a sensitivity analysis is to be able to determine the shape of the volume as this defines the behaviour precisely of a component. However, it has not been possible to find the surfaces of the volume that cause the neuron to activate due to limitations of known NN training methods. The only way it has been possible to examine the surfaces is by determining the range of each of the weights with statistical methods. Knowledge of the actual surfaces of the volume would be ideal since they define the relationships that exist between the weights and from this the ranges of the weights can be determined if desired.
It is highly desirable to be able to determine what a feed-forward NN has learnt during training and as a result much research has been done on trying to ascertain what relationships exist within data and have been learnt by a NN. This has been called comprehensibility and is one attribute that contributes to determining how good a training algorithm is. The methods currently used to extract rules from the NN are performed after training has been completed.
The types of relationships that are desirable that are required to be found are given as prepositional logic. These requirements can be summarised by the following: (a) One that will define all the numeric solutions that satisfy the training conditions, and thus allows a sensitivity analysis to be performed on the NN easily; and (b) One that will allow the rules learnt by the NN during training to classify the data set to be easily read from the NN.
Of the known training algorithms mentioned above relating to various dynamic algorithms, the only one that comes close to allowing rules to be read directly from the NN is the hyperspherical classifiers, which form OR relationships between the regions. Hence regions cannot be combined with AND, as the regions in the input space belong to a certain category or not. If they do not belong in the region then a sphere is added to suppress the activation of neurons that should not, hence OR is adequate to express the input space. The radius that defines the hyperspheres tends to 0 as the input space becomes complex and ultimately a hypersphere is added for each input vector. Although the regions defined by the neurons in the hidden layers approximate regions in the input space, they do not define it, except in the worst case where there are as many hyperspheres as data points. PTTHCs attempt to improve the coverage of the input space, and thus improve generalisation performance at the expense of computational complexity, and hence, is much slower.
CCN, Meiosis and DNC all train the weights as numbers and hence it is not easy to determine what relationships have been found within the data during training.
All of these algorithms dynamically allocate neurons to the NN with varying degrees of performance success with regard to generalisation. Some algorithms are better at some data sets than others, and all except the hyperspherical classifiers lose boundary condition information of the weight-space, and hence are not very useful for rule extraction.
Some algorithms learn some data sets quicker than others, such as the Meiosis algorithm which is based on annealing which tends to be slower even than backpropagation.
CCN and DNC are reported to have fast training times for specific data sets, but these are not single pass algorithms, as both rely on iteration to reduce the amount of error in the system before neurons are added into the NN.
As yet there has been no NN training algorithm that learns in a single pass that also adds neurons to the NN as required and allows rules to be read directly from the NN.
It is therefore an object of the present invention to provide a NN training method that is both relational and dynamic, in the sense that neurons can be allocated into a NN as required to learn a data set.
A further object of the present invention is to provide a NN training method that can learn a data set in a single pass.
Yet a further object of the present invention is to provide a NN training method that allows rules to be read directly from a NN.