1. Field of the Invention
The present invention relates to computer-based classification techniques, which are used to identify members of groups of interest within data sets. More specifically, the present invention relates to a method and an apparatus that uses a parallel genetic computational technique to optimize kernel parameters for a support vector machine (SVM), wherein the SVM is subsequently used to classify members of a data set.
2. Related Art
Classification and pattern recognition techniques have wide-reaching applications. A number of life science applications use classification techniques to identify members of groups of interest within clinical data sets. For example, an important life science application is concerned with the classification of the protein signatures of patients who have some type of cancer from those who do not. This problem stems from the need in clinical trials to test the efficacy of a drug in curing cancer while the cancer is at an early stage. In order to do so, one needs to be able to identify patients who have cancer at an early stage.
Conventional diagnosis techniques are not sufficient for this application. A popular technique (from an area that has become known as “proteomics”) is to analyze mass spectra, which are produced by a mass spectrometer from serum samples of patients. Depending on the type of cancer, the mass spectra of serum samples can show distinct signatures, which are not immediately visible to the naked eye. Several existing data mining techniques are presently used to distinguish the cancer spectra from the normal ones, such as Naïve Bayes, Decision Trees, Principle-Components-Analysis based techniques, Neural Networks, etc.
However, these existing techniques are characterized by false-alarm and missed-alarm probabilities that are not sufficiently small. This is a problem because false alarms can cause patients to experience anxiety, and can cause them submit to unnecessary biopsies or other procedures, while missed alarms can result in progression of an undetected disease.
Support Vector Machines (SVMs) provide a new approach to pattern classification problems. SVM-based techniques are particularly attractive for the cancer classification problem because SVM-based techniques operate robustly for high-dimensional feature data, unlike other techniques which have resource requirements that are closely coupled with feature dimensions.
SVM-based techniques typically use a “kernel function” to map the data set of interest from a low-dimensional input space to a higher-dimensional feature space. During this process, these techniques typically select a set of parameters for the kernel function so as to minimize the number of misclassifications that arise while classifying the data set.
Unfortunately, there presently exists no systematic technique for selecting the optimal kernel parameters for an SVM on a given data set. Consequently, SVM kernel parameters are typically optimized through time-consuming manual operations.
Hence, what is needed is a method and an apparatus that optimizes SVM kernel parameters without the problems described above.