The present invention relates to computer software for classifying data. In particular, the invention uses flattening and addition of attributes to perform classification of sparse high dimensional data to more accurately predict a data class based on data attributes.
Classification is the process of assigning a data object, based on the data object""s attributes, to a specific class from a predetermined set. Classification is a common problem studied in the field of statistics and machine learning. Some well-known classification methods are decision trees, statistical methods, rule induction, genetic algorithms, and neural networks.
A classification problem has an input dataset called the training set that includes a number of entries each having a number of attributes (or dimensions). A training set with n possible attributes is said to be n-dimensional. The objective is to use the training set to build a model of the class label based on the attributes, such that the model can be used to classify other data not from the training set. The model often takes the form of a decision tree, which is known in the art.
An example of a typical classification problem is that of determining a driver""s risk for purposes of calculating the cost of automobile insurance. A single driver (or entry) has many associated attributes (or dimensions), such as age, gender, marital status, home address, make of car, model of car, type of car, etc. Using these attributes, an insurance company determines what degree of risk the driver imposes to the insurance company. The degree of risk is the resultant class to which the driver belongs.
Another example of a classification problem is that of classifying patients"" diagnostic related groups (DRGs) in a hospital. That is, determining a hospital patient""s final DRG based on the services performed on the patient. If each service that could be performed on the patient in the hospital is considered an attribute, the number of attributes (dimensions) is large but most attributes have a xe2x80x9cnot presentxe2x80x9d value for any particular patient because not all possible services are performed on every patient. Such an example results in a high-dimensional, sparse dataset.
A problem exists in that artificial ordering induced on the attributes lowers classification accuracy. That is, if two patients each have the same six services performed, but they are recorded in different orders in their respective files, a classification model would treat the two patients as two different cases, and the two patients may be assigned different DRGs.
Another problem that exists in classification pertaining to high-dimensional sparse datasets is that the complexity required to build a decision tree is high. There are often hundreds, even thousands or more, possible attributes for each entry. Thus, there are hundreds, or thousands, of possible attributes on which to base each node""s splitting criterion in the decision tree. The large number of attributes directly contributes to a high degree of complexity required to build a decision tree based on each training set.
A goal of the invention is to provide a classification system that overcomes the identified problems.
In one embodiment, the present invention provides a method and apparatus for classifying high-dimensional data. The invention performs classification by storing the data in a computer memory, flattening the data into a boolean representation, and building a classification model based on the flattened data. The classification model can be a decision tree or other decision structure. In one aspect of the invention, large itemsets are used as additional attributes on which to base the decision structure. In another aspect of the invention, clustering is performed to provide additional attributes on which to base the decision structure.
In another embodiment, the invention provides a method and apparatus for classifying high-dimensional data using nearest neighbor techniques. The data is stored in a computer memory, flattened into a boolean representation, and classified based on the m nearest neighbors of an entry.
An advantage of the invention is that flattening the data removes any artificial ordering introduced into the data as a result of non-uniform recording procedures, thus yielding more accurate results.
Another advantage of the present invention is that the use of additional attributes based on large itemsets and clustering improves the accuracy of the resulting decision tree on which classification is based. This is achieved by determining which itemsets are large itemsets, and then using large itemsets as additional attributes on which a tree node""s splitting criterion might be based. Clustering may also be used to increase accuracy in building a decision structure.