The present invention relates to computer-based classification techniques, which are used to identify members of groups of interest within datasets.
Classification and pattern recognition techniques have wide-reaching applications. A number of life science applications use classification techniques to identify members of groups of interest within clinical datasets. For example, an important life science application is concerned with the classification of the protein signatures of patients who have some type of cancer from those who do not. This problem stems from the need in clinical trials to test the efficacy of a drug in curing cancer while the cancer is at an early stage. In order to do so, one needs to be able to identify patients who have cancer at an early stage.
Conventional diagnosis techniques are not sufficient for this application. A popular technique (from an area that has become known as “proteomics”) is to analyze mass spectra, which are produced by a mass spectrometer from serum samples of patients. Depending on the type of cancer, the mass spectra of serum samples can show distinct signatures, which are not immediately visible to the naked eye. Several existing data mining techniques are presently used to distinguish the cancer spectra from the normal ones, such as Naïve Bayes, Decision Trees, Principle-Components-Analysis based techniques, Neural Networks, etc.
However, these existing techniques are characterized by false-alarm and missed-alarm probabilities that are not sufficiently small. This is a problem because false alarms can cause patients to experience anxiety, and can cause them submit to unnecessary biopsies or other procedures, while missed alarms can result in progression of an undetected disease.
Support Vector Machines (SVMs) provide a new approach to pattern classification problems. SVM-based techniques are particularly attractive for the cancer classification problem because SVM-based techniques operate robustly for high-dimensional feature data, unlike other techniques which have resource requirements that are closely coupled with feature dimensions.
However, the application of SVM's in areas involving huge datasets, such as in proteomics, is constrained by extremely high computation cost, in terms of both the compute cycles needed as well as enormous physical memory requirements. For large datasets, which are not unusual in most life sciences problems, a quadratic optimization problem that arises during the training phase of the SVM's requires that one be able keep in the memory an N×N matrix, where N is the number of data vectors. This presents huge challenges for conventional high-end enterprise computer servers when the input datasets contain thousands or tens of thousands of data vectors. In addition, the training time for the algorithm grows in a manner that is polynomial in N. Current state-of-the-art research papers propose using heuristic, data-level decomposition approaches; but often these heuristic approaches are designed with little or no quantitative justification and suboptimal results.