There are many approaches to generating computer models for use in classifying new instance data where the models are generated on the basis of a training set of instance data—typically historical data. The subject matter falls within the general topic of machine learning, which is the study of computer programs and algorithms that learn from experience. Machine learning approaches are numerous and include such varied approaches as decision tree learning, Bayesian learning (such as the optimal and so-called “naïve” Bayes classifiers and Bayesian networks), neural networks, genetic algorithms and additive models (such as linear or logistic regression, but also including the two Bayes classifiers). For a detailed study of these and other approaches, the reader is referred to ‘Machine Learning’ by Tom M. Mitchell, 1997, published by McGraw-Hill.
To help understand the various approaches of the prior art and the terminology of this topic, it is helpful to give a concrete example of a particular classification problem. We will also use this concrete example to explain the approach of the present invention. Let us suppose that a bank decides whether or not to grant loans to new applicants depending on an estimation of whether or not the loans are likely to be repaid—ie the problem is to classify new loan applications into two classes: 1) likely to be repaid and 2) not likely to be repaid. The bank has historical records of previous loan applications including whether or not those loans were repaid. The bank wishes to use these records to make its decision on new applications. Thus, the bank wishes to generate a model, on the basis of the historical records, which takes information concerning a new loan application as its input and classifies the application into one of the two classes. The main aim in generating such a model is to maximise its predictive ability.
In machine learning terminology, each loan application (new or historical) is called an instance or a case. The information recorded about instances must be defined in some consistent way. Let us suppose that two facts about the applicant are recorded for each loan application: 1) the age of the applicant (either up to 30 years old, or 31 and over); and 2) whether or not the applicant owns their home. This is a highly simplified example, but it will help to understand the concepts involved. The two facts—age and home ownership—are called attributes or predictors. The particular items of information recorded about each attribute are called attribute values. The dataset of historical instances and outcomes (whether or not the loans were repaid) is called the training dataset. The sought representation of the model to be generated on the basis of the training data set is sometimes called the target function.
Note that the present invention is concerned with generating a model on the basis of the training dataset, rather than performing a full multivariate analysis of the training dataset taking each possible combination of attribute values into account which, although it is relatively straightforward in this oversimplified example with only two attributes and two pairs of attribute values, is computationally intractable in terms of data processing and storage requirements with realistic problems having many more attributes and attribute values.
Some of the approaches to machine learning generate models from the training dataset using a top-down approach—ie starting with the attributes. For example, decision tree algorithms generally begin with the question “which attribute should be tested at the root of the tree?” To answer this question, each attribute is statistically evaluated to determine how accurately it classifies the training dataset. The best attribute is selected for the root node of the decision tree. A branch node is then created for each possible value of this attribute and the training dataset is divided into subsets corresponding to each particular value of the chosen attribute. For each branch node, the remaining attributes are again statistically evaluated to determine which best classifies the corresponding training data subset, and this attribute is used for the branch node. The process is then repeated until all the attributes and possible values are arranged in the form of a decision tree with the statistically most likely classification at each end branch. Thus, the following decision nodes represent an example decision tree model generated for the loan application problem where the age of the applicant was found to be the most predictive attribute:                1) is the age of applicant up to 30 years old?         if yes then proceed to node 2);         if no, then proceed to node 3).        2) does the applicant own their home?         if yes, then classification=unlikely to repay loan;         if no, then classification=likely to repay loan.        
3) does the applicant own their home?                 if yes, then classification=likely to repay loan;         if no, then classification=unlikely to repay loan.        
Decision tree models are hierarchical. One problem with the top-down approach to generating a model is that it divides the training dataset into subsets at each branch node on the basis of attribute values and, thus, the statistical evaluations performed at branch nodes lower down the hierarchy take increasingly smaller subsets of the training dataset into account. The training dataset is successively partitioned at each branch node. Thus, in the model generation process, decisions made at lower levels are based on statistically less significant sample sizes. Decision tree models tend to be inaccurate and overfit the training dataset. Overfitting means that a model accurately predicts the training dataset, but is less than optimal in general. This may be because the training dataset is noisy (ie contains some atypical instances) which result in a somewhat skewed model that, through accurately predicting the noisy training datatset, has a less than optimal predictive ability in the general case. Conversely, underfitting means a model that has less than optimal predictive ability with respect to the training dataset. An underfitted model will not be able to accurately predict all of the instances of the training dataset.
Bayesian networks (or Bayesian belief networks) also employ a top-down approach. A Bayesian network represents a joint probability distribution for a set of attributes by specifying a set of conditional independence assumptions together with a set of local conditional probabilities. A single node in the network represents each attribute. Various nodes are connected by directed arcs to form a directed acyclical graph (DAG) and local probability distributions are determined for each node only on the basis of those nodes which are connected to the node in question by an arc of the DAG directed towards the node in question. The absence of an arc of the DAG directed towards the node in question indicates conditional independence. Like decision trees, Bayesian networks are suited to representing causal relationships between the attributes (represented by the DAG), but this is often supplied a priori rather than determined from the training data set.
Other general approaches to generating models, called additive models, use a bottom-up approach in that they start with the training dataset rather than the attributes. All internal parameters of the model are determined on the basis of the complete training dataset. Logistic and linear regression are examples of additive models. Additive models are not hierarchical. With linear regression, the target function is represented as a linear function of the form:F(x)=w0+w1a1(x)+ . . . wnan(x)where x denotes the new instance; ai(x) denotes the value of the ith attribute of a total of n attributes for instance x; and w0 . . . wn are constants (ie the internal parameters of the model). The parameters w0 . . . wn are estimated by minimising the squared error over all the training dataset. Thus, unlike decision trees and Bayesian networks, all the internal parameters are estimated using the whole dataset. With logistic regression, other methods of determining the internal parameters may be used, though typically they are still estimated using the whole dataset. The optimal and “naïve” Bayes classifiers are also examples of additive models.
However, additive models suffer from the assumption that any pattern existing in the training dataset can be represented by linearly weighting and adding the evidence coming from each attribute individually. While this may be valid where the attributes are independent, additive models fail to represent non-linear patterns that depend on combinations of particular attributes values. To give an illustrative example, using the bank loan problem, let us suppose that a pattern exists in the training dataset—namely, when young people own a house, they are likely to default on the loan, perhaps because they already have problems with paying their mortgage. On the other hand, let us suppose that older people who rent are also likely to default on the loan, perhaps they rent a house because their salary is too low to obtain a mortgage, or perhaps because they previously owned their own home but defaulted on their mortgage repayments. This is a non-linear pattern which cannot be represented with univariate models which determine internal parameters by considering the effect of one attribute at a time. Suppose the distribution of the attributes AGE and HOME from the training dataset of 700 instances are as follows:
AGErepaiddefaultedp (repaid | AGE)≦301002000.333>302101900.525
HOMErepaiddefaultedp (repaid | HOME)own2001000.667not own1102900.275
where p (repaid|AGE) and p (repaid|HOME) denote the posterior probability distribution for repaid loans in the training dataset given the attribute AGE or HOME respectively.
An additive model, such as logistic regression, will estimate the internal parameters on the basis of individual attributes—ie using a univariate approach. The model may be represented as follows:
p (repaid | AGE, HOME) =AGEHOMEc1*p (repaid | AGE) + c2*p(repaid | HOME)≦30ownc1*0.333 + c2*0.667≦30not ownc1*0.333 + c2*0.275>30ownc1*0.525 + c2*0.667>30not ownc1*0.525 + c2*0.275
However, this model will fail to discriminate the non-linear patterns described above that exist in the training dataset. Consider the distinction made between younger people who don't own their own home and older people who don't own their own home. The factor based on home ownership, c2*p(repaid|HOME), is the same for both these groups—ie c2*0.275, but the factor based on age, c1*p (repaid|AGE), suggests that older people who don't own their own home are more likely to repay the loan than younger people who don't own their own home—c1*0.525 compared to c1*0.333. But this may actually be wrong. Let us suppose that the patterns suggested above exist in the training dataset, and thus the actual probabilities that are present in the training dataset are represented by the following table, which shows the probability of repaying the loan (calculated using the numbers of repaid loans over the total number of loans) in each of the four categories:
ownnot own≦3010/100 = 0.190/200 = 0.45>30190/200 = 0.9520/200 = 0.1
From this table it is clear that younger people who do not own their own home are 4.5 times more likely to repay their loans than older people who do not own their own home. This significant pattern in the data is completely missed by the additive model that suggests that older people who do not own their own home are more likely to repay than younger people who do not own their own home.
Similarly, consider, in the additive model, the distinction made between younger people who don't own their own home and younger people who do own their own home. The factor based on younger age, c1*p(repaid|AGE), is the same for both these groups—ie c1*0.333, but the factor based on home ownership, c2*p (repaid|HOME), suggests that younger people who do own their own home are more likely to repay the loan than younger people who don't own their own home—c2*0.667 compared to c2*0.275. But, given the pattern described above, this is again wrong. It is clear, from the probability table above, that younger people who do not own their own home are 4.5 times more likely to repay their loans than younger people who do own their own home.
In these two cases, the additive model goes wrong because the effect of adding weighted factors for each attribute is to estimate internal parameters on the basis of individual attributes, whereas patterns exist in respect of combinations of particular attribute values. In general, additive models tend to underfit since they cannot represent non-linear patterns.
One object of the present invention is to provide a method and computer program for generating a model, for use in classifying new instance data on the basis of a training set of instance data, which estimates all internal parameters of the model using the evidence of the entire training dataset—ie which does not estimate any internal parameter on the basis of a subset of the training dataset.
Another object of the present invention is to provide a method and computer program, for generating a model for use in classifying new instance data on the basis of a training set of instance data.
Another object of the present invention is to provide a method and computer program, for generating a model for use in classifying new instance data on the basis of a training set of instance data, which is capable of detecting and representing non-linear patterns.
Another object of the present invention is to provide a method and computer program, for generating a model for use in classifying new instance data on the basis of a training set of instance data, which is efficient in terms of processing and memory requirements.
Another object of the present invention is to provide a method and computer program, for generating a model for use in classifying new instance data on the basis of a training set of instance data, which fits the training dataset accurately, without overfitting.