Data mining is the use of automated data analysis techniques to uncover previously undetected or non-preselected relationships among data items. Examples of data mining applications can be found in diverse areas such as database marketing, financial investment, image analysis, medical diagnosis, production manufacturing forensics, security, defense, and various fields of research.
Computer Aided Detection (CAD) applications are very interesting data mining problems in medical applications. The ultimate goal in a CAD system is to be able to identify the sick patients by analyzing a measurement data set and using the available descriptive features. CAD application present a number of challenges. For instance, typical CAD training data sets are large and extremely unbalanced between positive and negative classes of data items (positive classes of data items being associated with disease states, for example). When searching for descriptive features that can characterize the medical conditions of interest, system developers often deploy a large feature set, which may introduce irrelevant and redundant features. Labeling is often noisy as labels may be created without corresponding biopsies or other independent confirmations. In the absence of CAD decision support, labeling made by humans typically relies on a relatively small number of features, naturally due to limitations in the number of independent features that can be reasonably integrated by human decision makers. In order to achieve clinical acceptance, CAD systems have to meet extremely high performance thresholds to provide value to physicians in their day-to-day practice.
Nearest Neighbor Vote classification and Full Decision Boundary Based (e.g., Support Vector Machine) classification are popular approaches to real life data classification applications. In Nearest Neighbor Vote classification, the neighbors (i.e. the data items in the training set that are sufficiently similar or close to the data item to be classified), are found by scanning the entire data set. The predominant class in that neighbor set is assigned to the subject. U.S. Pat. No. 6,941,303 to Perrizo, incorporated herein by reference in its entirety, describes a Nearest Neighbor Vote classification technique that is a variant of the well-known K-Nearest Neighbor (KNN) classification approach. KNN methods are desirable methods since no residual model “classifier” needs to be built ahead of time (e.g., during a training phase). Models involve approximations and summarizations and therefore are prone to being less accurate.
However, Nearest Neighbor Vote methods have limitations also in being able to properly classify data items in data sets where there is a great disparity in the sizes of the different classes and where there is a very large training data set. When the class sizes are vastly different, the voting can be weighted by class size, but still, the subset of nearest neighbors can, for instance, have no data items the small classes and therefore give the wrong result. The result of neighbor voting in this instance would produce an inaccurate classification. When the training set is very large the process of isolating the nearest neighbor set can be prohibitively slow.
Support Vector Machine (SVM) classification is generally regarded as a technique that produces high-accuracy classification. In classification, a data item to be classified may be represented by a number of features. If, for example, the data item to be classified is represented by two features, it may be represented by a point in 2-dimensional space. Similarly, if the data item to be classified is represented by n features, also referred to as the “feature vector”, it may be represented by a point in n-dimensional space. The training set points to be used to classify that data item are points in n+1 dimensional space (the n feature space dimensions plus the one additional class label dimension). SVM uses a kernel to translate that n+1 dimensional space to another space, usually much higher dimensional, in which the entire global boundary (or the global boundary, once a few “error” training points are removed). This linear boundary (also referred to as a hyperplane), which separates feature vector points associated with data items “in a class” and feature vector points associated with data items “not in the class.” The underlying premise behind SVM is that, for any feature vector space, a higher-dimensional hyperplane exists that defines this boundary. A number of classes can be defined by defining a number of hyperplanes. The hyperplane defined by a trained SVM maximizes a distance (also referred to as an Euclidean distance) from it to the closest points (also referred to as “support vectors”) “in the class” and “not in the class” so that the SVM defined by the hyperplane is robust to input noise. U.S. Pat. No. 6,327,581 to Platt, incorporated by reference herein in its entirety, describes conventional SVM techniques in greater detail. While SVM provides superior accuracy, it tends to be computationally expensive, making the method unsuitable for very large training data sets or data sets having data items with a large number of different attributes.
Conventional data mining techniques have been applied in only certain areas in which the datasets are of a small enough size or small enough dimensionality that analysis can be performed reasonably quickly and cost-efficiently using available computing technology. In other areas, however, such as bioinformatics, where analysis of microarray expression data for DNA is required, as nanotechnology where data fusion must be performed, as VLSI design, where circuits containing millions of transistors must be tested for accuracy, as spatial data, where data representative of detailed images can comprise billions of bits, as Computer Aided Detection from radiological images, where the number of features and the number of training points can both be so large that mining implicit relationships among the data can be prohibitively time consuming, even utilizing the fastest supercomputers. A need therefore exists for improved data mining techniques that provide both, high performance in terms of achieving accurate results, and computational efficiency for enabling data mining in large or high-dimensional data sets.