1. Field of the Invention
The present invention generally relates to computer databases and, more particularly, to data mining and knowledge discovery. The invention specifically relates to a method for constructing segmentation-based predictive models, such as decision-tree classifiers, wherein data records are partitioned into a plurality of segments and separate predictive models are constructed for each segment.
2. Background Description
Data mining is emerging as a highly advantageous application of computer databases that addresses the problem of extracting useful information from large volumes of data. As Matheus, Chan, and Piatetsky-Shapiro point out (see C. J. Matheus, P. K. Chan, and G. Piatetsky-Shapiro, “Systems for knowledge discovery in databases,” IEEE Transactions on Knowledge and Data Engineering, Special Issue on Learning and Discovery in Knowledge-Based Databases, Vol. 5, No. 6, pp. 903-913, December 1993):                “The corporate, governmental, and scientific communities are being overwhelmed with an influx of data that is routinely stored in on-line databases. Analyzing this data and extracting meaningful patterns in a timely fashion is intractable without computer assistance and powerful analytical tools. Standard computer-based statistical and analytical packages alone, however, are of limited benefit without the guidance of trained statisticians to apply them correctly and domain experts to filter and interpret the results. The grand challenge of knowledge discovery in databases is to automatically process large quantities of raw data, identify the most significant and meaningful patterns, and present these as knowledge appropriate for achieving the user's goals.”        
Because the data-mining/knowledge-discovery problem is broad in scope, any technology developed to address this problem should ideally be generic in nature, and not specific to particular applications. In other words, one should ideally be able to supply a computer program embodying the technology with application-specific data, and the program should then identify the most significant and meaningful patterns with respect to that data, without having to also inform the program about the nuances of the intended application. Creating widely applicable, application-independent data-mining technology is therefore an explicit design objective for enhancing the usefulness of the technology. It is likewise a design objective of database technology in general.
Predictive modeling is an area of data mining and knowledge discovery that is specifically directed toward automatically extracting data patterns that have predictive value. In this regard, it should be discerned that constructing accurate predictive models is a significant problem in many industries that employ predictive modeling in their operations.
For example, predictive models are often used for direct-mail targeted-marketing purposes in industries that sell directly to consumers. The models are used to optimize return on marketing investment by ranking consumers according to their predicted responses to promotions, and then mailing promotional materials only to those consumers who are most likely to respond and generate revenue.
The credit industry uses predictive modeling to predict the probability that a consumer or business will default on a loan or a line of credit of a given size based on what is known about that consumer or business. The models are then used as a basis for deciding whether to grant (or continue granting) loans or lines of credit, and for setting maximum approved loan amounts or credit limits.
Insurance companies use predictive modeling to predict the frequency with which a consumer or business will file insurance claims and the average loss amount per claim. The models are then used to set insurance premiums and to set underwriting rules for different categories of insurance coverage.
On the Internet, predictive modeling is used by ad servers to predict the probability that a user will click-through on an advertisement based on what is known about the user and the nature of the ad. The models are used to select the best ad to serve to each individual user on each Web page visited in order to maximize click-though and eventual conversion of user interest into actual sales.
The above applications are but a few of the innumerable commercial applications of predictive modeling. In all such applications, the higher the accuracy of the predictive models, the greater are the financial rewards.
The development of application-independent predictive modeling technology is made feasible by the fact that the inputs to a predictive model (i.e., the explanatory data fields) can be represented as columns in a database table or view. The output(s) of a predictive model can likewise be represented as one or more columns.
To automatically construct a predictive model, one must first prepare a table or view of training data comprising one or more columns of explanatory data fields together with one or more columns of data values to be predicted (i.e., target data fields). A suitable process must then be applied to this table or view of training data to generate predictive models that map values of the explanatory data fields into values of the target data fields. Once generated, a predictive model can then be applied to rows of another database table or view for which the values of the target data fields are unknown, and the resulting predicted values can then be used as basis for decision making.
Thus, a process for constructing a predictive model is essentially a type of database query that produces as output a specification of a desired data transformation (i.e., a predictive model) that can then be applied in subsequent database queries to generate predictions.
To make predictive modeling technology readily available to database applications developers, extensions to the SQL database query language are being jointly developed by the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC) to support the construction and application of predictive models within database systems (see, for example, ISO/IEC FCD 13249-6:200x(E), “Information technology—Database languages—SQL Multimedia and Application Packages—Part 6: Data Mining,” Document Reference Number ISO/IEC JTC 1/SC 32N0647, May 21, 2001, http://www jtc1sc32.org/sc32/jtc1sc32.nsf/Attachments/D9C73B3214960D5988256A530060C50C/$FILE/32N0647T.PDF; for an overview see J. Melton and A. Eisenberg, “SQL Multimedia and Application Packages (SQL/MM),” SIGMOD Record, Vol. 30, No. 4, pp. 97-102, 2001, http://www.acm.org/sigmod/record/issues/0112/standards.pdf). This ISO/IEC standard aims to provide SQL structured types and associated functions for creating data mining task specifications, executing data mining tasks, querying data mining results, and, in cases where the results are predictive models, applying data mining results to row data to generate predictions. For example, the standard requires that both data mining task specifications and data mining results be stored as Character Large Objects (CLOBs). The standard likewise specifies sets of functions to be used for manipulating these database objects. By providing a standard application programming interface (API) for utilizing data mining technology with database systems, the standard is expected to promote wide use of data mining technology by enabling database application developers to readily apply such technology in business applications simply by writing SQL queries. In so doing, the standard effectively makes data mining a component technology of database systems.
Many methods are known for automatically constructing predictive models based on training data. It should be discerned that segmentation-based models afford the flexibility needed to attain high levels of predictive accuracy, and that previously unknown and potentially useful information about a company's operations and customer base can be extracted from corporate databases by first constructing segmentation-based predictive models from the data and then examining those models in detail to identify previously unknown facts.
An example of a segmentation-based predictive model is a decision tree classifier. Well-known procedures exist for constructing such models. The usual method is summarized as follows by Quinlan (see J. R. Quinlan, “Unknown attribute values in induction,” Proceedings of the Sixth International Machine Learning Workshop, pp 164-168, Morgan Kaufmann Publishers, 1989):                “The ‘standard’ technique for constructing a decision tree classifier from a training set of cases with known classes, each described in terms of fixed attributes, can be summarized as follows:                    If all training cases belong to a single class, the tree is a leaf labeled with that class.            Otherwise,                            select a test, based on one attribute, with mutually exclusive outcomes;                divide the training set into subsets, each corresponding to one outcome; and                apply the same procedure to each subset.”Details on the individual method steps can be found, for example, in the on-line statistics textbook provided over the Internet as a public service by StatSoft, Inc. Note that each subset of data mentioned in the above method steps is called a segment in the terminology employed herein.                                                
Decision trees provide a convenient example of the flexibility and interpretability of segmentation-based predictive models. Table 1 below shows the data definition for a data set commonly known within the predictive modeling community as the “Boston Housing” data (D. Harrison and D.L. Rubinfield, “Hedonic prices and the demand for clean air,”0 Journal of Environmental Economics and Management, Vol. 5, pp 81-102, 1978). Table 2 below shows twelve of the rows from this data set. A complete copy of the data set can be obtained over the Internet from the UCI Machine Learning Repository.
TABLE 1Data definition for the Boston Housing data set. Data fields havebeen assigned more intuitive names. The original names appear in the“a.k.a.” column.Data Fielda.k.a.DescriptionPRICEMEDVMedian value of owner-occupied homes(recoded into equiprobable HIGH,MEDIUM, and LOW ranges)ON_RIVERCHASCharles River indicator(value is 1 if tract boundsCharles River; else 0)CRIME_RTCRIMPer capita crime rate by town%BIGLOTSZNPercentage of residentialland zoned for lots over25,000 square feet%INDUSTYINDUSPercentage of non-retailbusiness acres per townNOXLEVELNOXConcentration of nitricoxides (recoded into equiprobablehigh, medium, and low ranges)AVGNUMRMRMAverage number of roomsper dwelling%OLDBLDGAGEPercentage of owner-occupiedunits built prior to 1940DIST2WRKDISWeighted distances to five Bostonemployment centersHWYACCESRADIndex of accessibility toradial highwaysTAX_RATETAXFull-valued property taxrate per $10,000CLASSIZEPTRATIOPupil-teacher ratio by town%LOWINCMLSTATPercent lower status ofthe population
TABLE 2(a)Twelve sample records from the Boston Housing data set (Part 1of 3).%PRICEON_RIVERCRIME_RT% BIGLOTSINDUSTY1HIGH00.00618.002.312MEDIUM00.0270.007.073HIGH00.0320.002.184MEDIUM00.08812.507.875LOW00.21112.507.876MEDIUM00.6300.008.147MEDIUM00.15425.005.138MEDIUM00.1010.0010.019LOW00.2590.0021.8910LOW13.3210.0019.5811LOW00.20622.005.8612LOW18.9830.0018.10
TABLE 2(b)Twelve sample records from the Boston Housing data set (Part 2of 3).NOXLEVELAVGNUMRM%OLDBLDGDIST2WRK1medium6.5865.204.092low6.4278.904.973low7.0045.806.064medium6.0166.605.565medium5.63100.006.086medium5.9561.804.717low6.1429.207.828medium6.7181.602.689high5.6996.001.7910high5.40100.001.3211low5.5976.507.9612high6.2197.402.12
TABLE 2(c)Twelve sample records from the Boston Housing data set (Part 3of 3).HWYACCESTAX_RATECLASSIZE% LOWINCM1129615.304.982224217.809.143322218.702.944531115.2012.435531115.2029.936430721.008.267828419.706.868643217.8010.169443721.2017.1910540314.7026.8211733019.1012.50122466620.2017.60
Harrison and Rubinfield collected and analyzed these data to determine whether air pollution had any effect on house values within the greater Boston area. One approach to addressing this question is to build a model that predicts house price as a function of air pollution and other factors that could potentially affect house prices.
FIG. 1 shows a decision tree generated from the Boston Housing data using the CART algorithm (L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression Trees, New York: Chapman & Hall, 1984) as implemented in STATISTICA for Windows (STATISTICA for Windows [Computer program manual], Version 5.5, 1995, StatSoft, Inc., 2300 East 14th Street, Tulsa, Okla., 74104-4442, http://www.statsoft.com). The STATISTICA program was told to construct a decision tree model that predicts PRICE (i.e., the median value of owner-occupied homes broken down into high, medium, and low ranges) using all of the other columns in the data table as potential inputs to the model (i.e., as explanatory data fields).
Each node 1 through 13 in the tree shown in FIG. 1 corresponds to a data segment (i.e., a subset of the data). Illustrated at each node are histograms of the proportions of high-, medium-, and low-priced neighborhoods that belong to the corresponding data segments. The price range that corresponds to each histogram bar is indicated by legend 14. Each node in FIG. 1 is also labeled with the dominant price range within the corresponding segment (i.e., the price range that has the largest histogram bar). Thus, for node 1, the dominant price range is medium, whereas for nodes 2 and 3 the dominant price ranges are high and low, respectively.
Tree branches correspond to tests on the values of the inputs to the predictive model and it is these tests that define the data segments that correspond to each node in the tree. For example, in FIG. 1, node 1 is the root of the tree and it corresponds to the entire set of data. Test 15 (i.e., % LOWINCM≦14.4) defines the data segments that correspond to nodes 2 and 3. Left-going branches in FIG. 1 are followed when the outcome of the corresponding test is “yes” or “true;” right-going branches are followed when the outcome of the test is “no” or “false.” Thus, node 2 corresponds to the subset of data for which % LOWINCM is less than or equal to 14.4, and node 3 corresponds to the subset of data for which % LOWINCM is greater than 14.4. Similarly, node 4 corresponds to the subset of data for which % LOWINCM is less than or equal to 14.4 and AVGNUMRM is less than or equal to 6.527, and so on.
The leaves of the tree (i.e., nodes 4, 5, 7, 8, 10, 12, and 13) correspond to the subsets of data that are used to make predictions in the decision tree model. In this example, the predictions are the dominant price ranges at the leaves of the tree. Thus, at node 4 the prediction would be “medium,” at node 5 it would be “high,” at node 7 it would be “low,” etc.
FIG. 1 demonstrates the ability of decision tree programs to automatically extract meaningful patterns from collections of data. As the tree model indicates, air pollution does have an effect on house prices, but only for neighborhoods that have a sufficiently large percentage of low-income housing. For all other neighborhoods, house prices are primarily affected by the size of the house, as indicated by the average number of rooms per house in the neighborhood. When air pollution is a factor, but the air pollution level is sufficiently small, then the next most predictive factors that affect house prices are crime rate, the percentage of non-retail industrial land, and the distance to a major center of employment, with the more desirable (i.e., higher-priced) neighborhoods being those with low crime rates (i.e., node 8) and those with sufficiently large percentages of non-retail industrial land located away from centers of employment (i.e., node 13).
To demonstrate that decision tree algorithms are not application-specific, but can be applied to any application simply by providing application-specific data as input, the STATISTICA program was executed again, but this time it was told to predict the air pollution level (i.e., NOXLEVEL) using all of the other data columns as explanatory variables, including PRICE. FIG. 2 shows the resulting tree model. As this tree illustrates, the majority of neighborhoods that have the highest levels of air pollution (i.e., node 28) are those with sufficiently large percentages of non-retail industrial land, sufficiently large percentages of older buildings, and sufficiently high tax rates. Not surprisingly, these factors characterize downtown Boston and its immediate vicinity. The majority of neighborhoods that have the lowest levels of air pollution (i.e., node 26) are those with sufficiently small percentages of non-retail industrial land, sufficiently large percentages of houses on large lots, and that are sufficiently far from centers of employment. These characteristics are typical of outlying suburbs. The majority of neighborhoods that have moderate levels of air pollution (i.e., node 29) are those with sufficiently small percentages of non-retail industrial land, sufficiently small percentages of houses on large lots, and easy access to radial highways that lead into Boston. These characteristics are typical of urban residential neighborhoods favored by commuters.
For both FIGS. 1 and 2, the relationships described above make intuitive sense once the tree models are examined in detail. However, it is important to keep in mind that the STATISTICA program itself has no knowledge of these intuitions nor of the source of data. The program is merely analyzing the data to identify patterns that have predictive value.
Nevertheless, the program produces meaningful results. The decision tree models that are produced as output are useful, concrete, and tangible results that have specific meaning with respect to the input data and the user-specified modeling objectives (i.e., which data field to predict in terms of which other data fields). From a database perspective, the specification of the input data and the modeling objectives constitutes a query, and the decision tree model that is produced as output constitutes a query result.
The usefulness of decision tree algorithms, in particular, and automated predictive modeling technology, in general, derives from the fact that they can perform their analyses automatically without human intervention, and without being told what kinds of relationships to look for. All that they need to be told is which data values are to be predicted, and which data values can be used as inputs to make those predictions. The generic nature of such technology makes the technology extremely useful for the purpose of knowledge discovery in databases. Moreover, it is the generic nature of predictive modeling technology that permits the technology to be incorporated into general-purpose database systems.
Note that, once a decision tree has been constructed—or, for that matter, once any type of predictive model has been constructed—the step of applying that model to generate predictions for an intended application is conventional, obvious, and noninventive to those skilled in the art of predictive modeling.
Although decision tree methods yield models that can be interpreted and understood for the purposes of knowledge discovery, the predictive accuracy of decision tree models can be significantly lower than the predictive accuracies that can be obtained using other modeling methods. This lower accuracy stems from the fact that decision trees are piecewise-constant models; that is, within each data segment defined by the leaves of the tree, the predictions produced by the model are the same for all members of that segment. FIG. 3 illustrates this effect in the case of regression trees, which are decision trees used to predict numerical values instead of categorical values. As FIG. 3 indicates, the output 39 of a piecewise-constant model (such as one produced by conventional decision tree algorithms) is stair-like in nature and is therefore inherently inaccurate when used to model data 38 that exhibits smooth variations in values relative to the inputs of the model.
Data analysts and applied statisticians have long realized this deficiency of decision tree methods and have typically employed such methods only as exploratory tools to “get a feel” for the data prior to constructing more traditional statistical models. In this use of decision tree methods, the resulting decision trees are analyzed to identify predictive explanatory variables that should considered for inclusion in the final model. Decision trees are also analyzed to identify potential interaction terms (i.e., arithmetic products of explanatory variables) to include in the final model, as well as potential nonlinear transformations that should be performed on the explanatory variables prior to their inclusion in the final model.
In many cases, the models that are produced using the above statistical methodology are, in fact, segmentation-based models, wherein the data are partitioned into pluralities of segments and separate predictive models are constructed for each segment. Such models are analogous to decision trees; however, unlike traditional decision tree, the predictive models associated with the data segments can be multivariate statistical models.
One popular approach for producing segmentation-based models using statistical methodologies involves first segmenting the data using statistical clustering techniques (see, for example, J. A. Hartigan, Clustering Algorithms, John Wiley and Sons, 1975; A. D. Gordon, “A review of hierarchical classification,” Journal of the Royal Statistical Society, Series A, Vol. 150, pp. 119-137, 1987; and J. D. Banfield and A. E. Raftery, “Model-based Gaussian and non-Gaussian clustering,” Biometrics, Vol. 49, pp. 803-821, 1993). Once the data has been segmented, separate multiple regression models are then constructed for each segment. The deficiency of this approach is that the clustering techniques that are typically employed are unsupervised. Specifically, such clustering techniques are concerned with grouping data based on spatial density, spatial proximity, or other similar criteria; they are not concerned with the effects that alternative segmentations have on the predictive accuracies of the models that will later be constructed for each segment. Because of this deficiency, there is no guarantee that the segmentation obtained will be advantageous for predictive modeling purposes.