1. Field of the Invention
This invention relates to support vector machines for separating data based on multiple characteristics. More particularly, it is directed to an apparatus and method for classifying millions of data points into separate classes using a linear or nonlinear separator using a Lagrangian support vector machine.
2. Discussion of the Prior Art
Support vector machines are powerful tools for data classification and are often used for data mining operations. Classification is based on identifying a linear or nonlinear separating surface to discriminate between elements of an extremely large data set containing millions of sample points by tagging each of the sample points with a tag determined by the separating surface. The separating surface depends only on a subset of the original data. This subset of data, which is all that is needed to generate the separating surface, constitutes the set of support vectors. Mathematically, support vectors are data points corresponding to constraints with positive multipliers in a constrained optimization formulation of a support vector machine.
Support vector machines have been used by medical institutions in making diagnostic and prognostic decisions as well as by financial institutions making credit and fraud detection decisions. For example, support vector machines are used to classify breast cancer patients using a criterion that is closely related to the decision whether a patient is prescribed to have chemotherapy treatment or not. This criterion is the presence of metastasized lymph nodes (node-positive) or their absence (node-negative).
By using a linear support vector machine, a number of available features are selected to classify patients into node-positive and node-negative patients. The total number of features used to constitute the n-dimensional space in which the separation is accomplished is made up of the mean, standard error and the maximum value of a certain number of cytological nuclear measurements of the size, shape and texture taken from a patient's breast along with the tumor size. A subset of the features is then used in a nonlinear support vector machine to classify the entire set of patients into three prognosis groups: good (node-negative), intermediate (1 to 4 metastasized lymph nodes) and poor (more than 4 metastasized lymph nodes). The classification method is used to assign new patients to one of the three prognostic groups with an associated survival curve and a possible indication of the utilization of chemotherapy or not.
This classification and data mining process, however, is extremely resource intensive, slow and expensive given current classification tools. To separate the millions of sample points into different data sets, costly linear and quadratic and programming solvers are often used that are complicated and cost prohibitive. Unfortunately, these tools are also very slow in processing and classifying the sample points.
What is needed, therefore, is an apparatus and method for simply and quickly solving problems with millions of sample points using standard tools, thereby eliminating the need for complicated and costly optimization tools. This apparatus and method would need to be based on a simple reformulation of the problem (e.g., an implicit Lagrangian formulation of the dual of a simple reformulation of the standard quadratic program of a linear support vector machine). This reformulation would thereby minimize an unconstrained differentiable convex function in an m-dimensional space where m is the number of points to be classified in a given n-dimensional input space. The necessary optimality condition for the unconstrained minimization problem would therefore be transformed into a simple symmetric positive definite complementary problem, thereby significantly reducing the computational resources necessary to classify the data.