1. Field of the Invention
The present invention generally relates to a technique of inductive learning. More specifically, an inductive model is built both “accurately” and “efficiently” by dividing a database of examples into N disjoint subsets of data, and a learning model (base classifier), including a prediction of accuracy, is sequentially developed for each subset and integrated into an evolving aggregate (ensemble) learning model for the entire database. The aggregate model is incrementally updated by each completed subset model. The prediction of accuracy provides a quantitative measure upon which to judge the benefit of continuing processing for remaining subsets in the database or to terminate at an intermediate stage.
2. Description of the Related Art
Modeling is a technique to learn a model from a set of given examples of the form {(x1, y1), (x2, y2), . . . , (xn, yn)}. Each example (xi, yi) is a feature vector, xi. The values in the feature vector could be either discrete, such as someone's marital status, or continuous, such as someone's age and income. Y is taken from a discrete set of class labels such as {donor, non-donor} or {fraud, non-fraud}.
The learning task is to predict a model y=f(x) to predict the class label from an example with a feature vector but without the true class label.
Inductive learning has a wide range of applications that include, for example, fraud detection, intrusion detection, charity donation, security and exchange, loan approval, animation, and car design, among many others.
The present invention teaches a new framework of scalable cost-sensitive learning. An exemplary scenario for discussing the techniques of the present invention is a charity donation dataset from which a subset of the data is to be chosen as individuals to whom to send campaign letters. Assuming that the cost of a campaign letter is $0.68, it should be apparent that it would be beneficial to send a letter only if the solicited person will donate at least $0.68.
That is, a learning model for this scenario must be taught how to choose individuals from a database containing information for individuals to be targeted for letters. Because there is a cost associated with the letters, and each individual will either donate different amount of money or does not donate at all, this model is cost-sensitive. The overall accuracy or benefits is the total amount of donated charity minus the total overhead to send solicitation letters.
A second scenario is fraud detection, such as credit card fraud detection. Fraud challenging and investigation are not free. There is an intrinsic cost associated with each fraud case investigation. Assuming that challenging a potential fraud costs $90, it is obvious that only if the “expected loss” of a fraud (when the same instance is sampled repeated) is more than $90, it is worthwhile for a credit card company to take actions.
As should be apparent, there is also a second cost associated with the development of the model that is related to the cost of the computer time and resources necessary to develop a model over a database, particularly in scenarios where the database contains a large amount of data.
Currently, a number of learning algorithms are conventionally used for modeling expected investment strategies in such scenarios as the campaign letter scenario, for example, decision tree learner C4.5®, rule builder RIPPER®, and the naïve Bayes learner.
In a database, each data entry is described by a series of feature values. For the charity donation example, each entry might describe a particular individual's income level, location lived, location worked, education background, gender, family status, past donation history, and perhaps other features.
The aforementioned C4.5® decision algorithm constructs a decision tree model from a dataset or a set of examples of the above form. A decision tree is a DAG (or Directed Acyclic Graph) with a single root. To build a decision tree, the learner first picks the most distinguishing feature from the set of features.
For example, the most distinguishing feature might be someone's income level. Then, the examples in the dataset will be “sorted” by their corresponding value of the chosen feature. For example, individual with lower income will be sorted through a different path than individuals with higher income. This process is repeated until either there is no more feature to use or the examples in a node all belong to one single category, such as donor or non-donor.
RIPPER® is another way to build inductive models. The model is a set of IF THEN rules. The naïve Bayes method uses the Bayesian Rule to build models.
Using these conventional methods, a user can experiment with different algorithms, parameters, and feature selections and, thereby, evaluate one or more models to be ultimately used for the intended application, such as selecting the individuals to whom campaign letters will be sent.
A problem recognized by the present inventors is that, in current learning model methods, the entire database must be evaluated before the effects of the hypothetical parameters for the test model are known. Depending upon the size of the database, each such test scenario will require much computer time (sometimes many hours or even days) and cost, and it can become prohibitive to spend so much effort in the development of an optimal model to perform the intended task.
Hence, there is currently no method that efficiently models the cost-benefit tradeoff short of taking time and computer resources to analyze the entire database and predicting the accuracy of the model for whose parameters are undergoing evaluation.