1. Field of the Invention
The present invention relates to computer-based classification techniques, which are used to identify members of groups of interest within data sets. More specifically, the present invention relates to a method and apparatus for reducing the execution time for parallel support vector machine (SVM) computations.
2. Related Art
Classification and pattern recognition techniques have wide-reaching applications. For example, a number of life science applications use classification techniques to identify members of groups of interest within clinical datasets. In particular, one important application involves distinguishing the protein signatures of patients that have a certain type of cancer from the protein signatures of patients who do not. This problem stems from the need in clinical trials to test the efficacy of a drug in curing cancer while the cancer is at an early stage. In order to do so, one needs to be able to identify patients who have cancer at an early stage.
Conventional diagnostic techniques are not sufficient for this application. A popular technique (from an area that has become known as “proteomics”) is to analyze mass spectra, which are produced by a mass spectrometer from serum samples of patients. Depending on the type of cancer, the mass spectra of serum samples can show distinct “signatures,” which are not immediately visible to the naked eye. Several existing data mining techniques are presently used to distinguish the cancer spectra from the normal ones, such as Naïve Bayes, Decision Trees, Principle-Components-Analysis based techniques, Neural Networks, etc.
However, these existing techniques are characterized by false-alarm and missed-alarm probabilities that are not sufficiently small. This is a problem because false alarms can cause patients to experience anxiety, and can cause them to submit to unnecessary biopsies or other procedures, while missed alarms can result in progression of an undetected disease.
Support Vector Machines (SVMs) provide a new approach to pattern classification problems. SVM-based techniques are particularly attractive for the cancer classification problem because SVM-based techniques operate robustly for high-dimensional feature data, unlike other techniques which have resource requirements that are closely coupled with feature dimensions.
SVMs can be used in large-scale pattern classification problems, in proteomics problems, and in genomics problems. For example, SVMs can be used to analyze a large database of proteomics “fingerprints” to separate patients who responded to a drug in clinical trials from those who did not respond to the drug. SVMs can also be used to analyze a large database of computer telemetry signals to separate servers that experienced a syndrome from those that did not experience the syndrome.
Another application in which SVMs can be used is to accurately predict the likelihood of customer escalation for reported bugs. For this application, classification is performed using data such as customer information, server type, software patch levels, and the nature of the problem. Yet another application uses pattern classification to predict system outage risk. Data such as system configuration information and reliability, availability, and serviceability (RAS) data are used.
During operation, SVMs find a hypersurface in the space of possible inputs such that the positive examples are split from the negative examples. The split in a multi-dimensional feature space is chosen to have the largest distance from the hypersurface to the nearest of the positive and negative examples. The nearest examples, which are vectors in multi-dimensional space, are called “support vectors.”
SVMs operate in two stages: (1) during a kernel mapping stage, a kernel function is used to map the data from a low-dimensional input space to a higher-dimensional feature space, and (2) during a subsequent optimization stage, the system attempts to find an optimal separating hypersurface in the feature space.
SVM memory requirements scale roughly with the square of the number of input vectors, whereas the central processing unit (CPU) requirements scale with the number of input vectors to the 2.8 power. Therefore, it is advantageous to distribute SVM computations across multiple computing nodes. In such a parallel distributed environment, the data is partitioned into pieces or “chunks” and the SVM program operates on each chunk in parallel on separate computing nodes. The results from each chunk are subsequently aggregated and the SVM is run one more time on the combined results to arrive at the final solution.
Unfortunately, the execution time of the optimization step in the SVM application is unpredictable and can result in one or more chunks which take a disproportionately long time to complete, thereby adversely affecting the completion time of the whole SVM program. In fact, individual computing nodes can get stuck in their optimization step for so long that the distributed problem takes longer to complete than if the entire problem were run on a sequential machine.
Hence, what is needed is a method and an apparatus for limiting the execution time of SVM computations in such situations.