The reference to any prior art in this specification is not, and should not, be taken as an acknowledgement or any form of suggestion that the prior art forms part of the common general knowledge.
A decision machine is a universal learning machine that, during a training phase, determines a set of parameters and vectors that can be used to classify unknown data. For example, in the case of the Support Vector Machine (SVM) the set of parameters consists of a kernel function and a set of support vectors with corresponding multipliers that define a decision hyperplane. The set of support vectors is selected from a training population of vectors.
In the case of a decision machine operating according to one of Principal Component Analysis, Kernel Principal Component Analysis (KPCA), Independent Component Analysis (ICA) and Linear Discriminant Analysis (LDA), a subspace and a corresponding basis is determined that can be used to determine the distance between two different data vectors and thus the classification of unknown data. Bayesian Intrapersonal/Extrapersonal Classifiers classify according to a statistical analysis of the differences between the groups being classified.
Subsequent to the training phase all of these decision machines operate in a testing phase during which they classify test vectors on the basis of the decision vectors and parameters determined during the training phase. For example, in the case of a classification SVM the classification is made on the basis of the decision hyperplane previously determined during the training phase. A problem arises however as the complexity of the computations that must be undertaken to make a decision scales with the number of support vectors used and the number of features to be examined (i.e. the length of the vectors). Similar difficulties are also encountered in the practical application of most other learning machines.
Decision machines find application in many and varied fields. For example, in an article by S. Lyu and H. Farid entitled “Detecting Hidden Messages using Higher-Order Statistics and Support Vector Machines” (5th International Workshop on Information Hiding, Noordwijkerhout, The Netherlands, 2002) there is a description of the use of an SVM to discriminate between untouched and adulterated digital images.
Alternatively, in a paper by H. Kim and H. Park entitled “Prediction of protein relative solvent accessibility with support vector machines and long-range interaction 3d local descriptor” (Proteins: structure, function and genetics, 2004 Feb. 15; 54(3):557-62) SVMs are applied to the problem of predicting high resolution 3D structure in order to study the docking of macro-molecules.
In order to develop this method for feature reduction the mathematical basis of an SVM will now be explained. It will however be realised that methods according embodiments of the present invention are applicable to other decision machines including those mentioned previously.
An SVM is a learning machine that given m input vectors xεd, drawn independently from the probability distribution function p(x) with an output value yi, for every input vector xi, returns an estimated output value f(xi)=yi for any vector xi, not in the input set.
The (xi, yi) i=0, . . . m are referred to as the training examples. The resulting function f(x) determines the hyperplane which is then used to estimate unknown mappings.
FIG. 1, illustrates the above training method. At step 24 the support vector machine receives a vectors xi of a training set each with a pre-assigned class yi. At step 26 the vector machine transforms the input data vectors xi by mapping them into a multi-dimensional space. Finally at step 28 the parameters of the optimal multi-dimensional hyperplane defined by f(x) is determined. Each of steps 24, 26 and 28 of FIG. 1 are well known in the prior art.
With some manipulations of the governing equations the support vector machine can be phrased as the following Quadratic Programming problem:min W(α)=½αTΩα−αT  (1)whereΩi,j=yiyjK(xi,xi)  (2)e=[1, 1, 1, 1, . . . , 1]T  (3)Subject to0=αTy  (4)0≦αi≦C  (5)whereC is some regularization constant.  (6)
The K(xi,xi) is the kernel function and can be viewed as a generalised inner product of two vectors. The result of training the SVM is the determination of the multipliers αi.
Suppose we train a SVM classifier with pattern vectors xi, and that r of these vectors are determined to be support vectors, Denote them by xi, i=1, 2 . . . , r. The decision hyperplane for pattern classification then takes the form
                              f          ⁡                      (            x            )                          =                                            ∑              l              r                        ⁢                                                  ⁢                                          α                i                            ⁢                              y                i                            ⁢                              K                ⁡                                  (                                      x                    ,                                          x                      i                                                        )                                                              +          b                                    (        7        )            
where αi is the Lagrange multiplier associated with pattern xi and K(.,.) is a kernel function that implicitly maps the pattern vectors into a suitable feature space. The b can be determined independently of the αi. FIG. 2 illustrates in two dimensions the separation of two classes by hyperplane 30. Note that all of the x's and o's contained within a rectangle in FIG. 2 are considered to be support vectors and would have associated non-zero αi.
Given equation (7) an un-classified sample vector x may be classified by calculating f(x) and then returning −1 for all returned values less than zero and 1 for all values greater than zero.
FIG. 3 is a flow chart of a typical method employed by prior art SVMs for classifying vectors xi of a testing set. At box 34 the SVM receives a set of test vectors. At box 36 it transforms the test vectors into a multi-dimensional space using support vectors as parameters in the kernel function. At box 38 the SVM generates a classification signal from the decision surface to indicate membership status, member of a first class “1” or of a second class “−1”, of each input data vector. Steps 34 through 40 are defined in the literature and by equation (7).
It will be realised that in both the training and testing phases, the computational complexity of the operations needed to define the hyperplane, and to subsequently classify input vectors, is at least in part dependent on the size of the vectors xi. The size of the vectors xi is in turn dependent upon the number of features being examined in the problem from which the xi are derived.
In the early phase of learning machine research and development few problems involved more than 40 features. However, it is now relatively common for problems involving hundreds to tens of thousands of variables or features to be addressed. Consequently the computations required to determine the test surface, and to perform classification has increased.
An example of this sort of problem is the classification of undesired email or “spam” and normal email. If the words or phrases used in the messages are used for classification then the number of features can be the size of the number of commonly used words. This number for an adult English speaker can easily exceed 5 to 10 thousand words. If we add misspellings of common words and proper and generic names of drugs and other products then this list of features can easily exceed 50 thousand words. The actual features (words of phrases) that are needed to separate spam and email may be considerably less than the total number of features. For example the word “to” will not add to the determination of a decision surface, but will be evident in many emails.
The problem of dealing with a very large number of features is discussed in a paper by Guyon and Elisseeff, entitled “An introduction to variable and feature selection”, Journal of Machine Learning Research, 3, 1157-1182, 2003. In that paper the authors explain that “There are many potential benefits of variable and features selection: facilitating data visualization and data understanding, reducing the measurement and storage requirements, reducing training and utilization times, defying the curse of dimensionality to improve prediction performance.” The authors of the article go on to state that they are unaware of any direct method for feature selection in the case of nonlinear learning systems.
It is an object of the invention to provide a method for feature selection that provides one or more of the potential benefits described above.