The present invention relates to techniques for computer-based classification, which can be used to identify members of groups of interest within datasets.
Classification and pattern recognition techniques have wide-reaching applications. For example, a number of life science applications use classification techniques to identify members of groups of interest within clinical datasets. In particular, one important application involves distinguishing the protein signatures of patients with certain type of cancer from the protein signatures of patients who do not. This problem stems from the need in clinical trials to test the efficacy of a drug in curing cancer while the cancer is at an early stage. In order to do so, one needs to be able to identify patients who have cancer at an early stage.
Conventional diagnostic techniques are not sufficient for this application. A popular technique (from an area that has become known as “proteomics”) is to analyze mass spectra, which are produced by a mass spectrometer from serum samples of patients. Depending on the type of cancer, the mass spectra of serum samples can show distinct signatures, which are not immediately visible to the naked eye. Several existing data mining techniques are presently used to distinguish the cancer spectra from the normal ones, such as Naïve Bayes, Decision Trees, Principle-Components-Analysis based techniques, Neural Networks, etc.
However, these existing techniques are characterized by false-alarm and missed-alarm probabilities that are not sufficiently small. This is a problem because false alarms can cause patients to experience anxiety, and can cause them submit to unnecessary biopsies or other procedures, while missed alarms can result in progression of an undetected disease.
Support Vector Machines (SVMs) provide a new approach to pattern classification problems. SVM-based techniques are particularly attractive for the cancer classification problem because SVM-based techniques operate robustly for high-dimensional feature data, unlike other techniques which have resource requirements that are closely coupled with feature dimensions.
However, the application of SVM's in areas involving huge datasets, such as in proteomics, is constrained by extremely high computation cost, in terms of both the compute cycles needed as well as enormous physical memory requirements.
For example, a quadratic optimization problem arises during the training phase of the SVM's for large datasets, which are common in most life sciences problems. Such a quadratic optimization problem typically requires the memory to accommodate an N×N matrix, where N is the number of data vectors. This creates huge challenges for conventional high-end enterprise computer servers when the input datasets contain thousands or tens of thousands of data vectors. In addition, the training time for the algorithm grows in a manner that is polynomial in N. Current state-of-the-art research papers propose using heuristic, data-level decomposition approaches; but often these heuristic approaches are designed with little or no quantitative justification and suboptimal results.