1. Field of the Invention
The present invention generally relates to data mining and knowledge discovery of computer databases. More specifically, an output of the first stage becomes an input to a second stage in a predictive model in which nonlinear transformations of input variables are automatically discovered and in which such transformed inputs are then combined via linear regression to stage-wise produce predictions/forecasts.
2. Description of the Related Art
Data mining is emerging as a highly advantageous application of computer databases that addresses the problem of extracting useful information from large volumes of data. Predictive modeling is an area of data mining and knowledge discovery that is specifically directed toward automatically extracting data patterns that have predictive value. Constructing accurate predictive models is a significant problem in many industries that employ predictive modeling in their operations.
For example, predictive models are often used for direct-mail targeted-marketing purposes in industries that sell directly to consumers. The models are used to optimize return on marketing investment by ranking consumers according to their predicted responses to promotions, and then mailing promotional materials only to those consumers who are most likely to respond and generate revenue.
The credit industry uses predictive modeling to predict the probability that a consumer or business will default on a loan or a line of credit of a given size based on what is known about that consumer or business. The models are then used as a basis for deciding whether to grant (or continue granting) loans or lines of credit, and for setting maximum approved loan amounts or credit limits.
Insurance companies use predictive modeling to predict the frequency with which a consumer or business will file insurance claims and the average loss amount per claim. The models are then used to set insurance premiums and to set underwriting rules for different categories of insurance coverage.
On the Internet, predictive modeling is used by ad servers to predict the probability that a user will click-through on an advertisement based on what is known about the user and the nature of the ad. The models are used to select the best ad to serve to each individual user on each Web page visited in order to maximize click-though and eventual conversion of user interest into actual sales.
The above applications are but a few of the innumerable commercial applications of predictive modeling. In all such applications, the higher the accuracy of the predictive models, the greater are the financial rewards.
Because the data-mining/knowledge-discovery problem is broad in scope, any technology developed to address this problem should ideally be generic in nature, and not specific to particular applications. In other words, one should ideally be able to supply a computer program embodying the technology with application-specific data, and the program should then identify the most significant and meaningful patterns with respect to that data, without having to also inform the program about the nuances of the intended application.
The development of application-independent predictive modeling technology is made feasible by the fact that the inputs to a predictive model (i.e., the explanatory data fields) can be represented as columns in a database table or view. The output(s) of a predictive model can likewise be represented as one or more columns.
To automatically construct a predictive model, one first prepares a table or view of training data comprising one or more columns of explanatory data fields together with one or more columns of data values to be predicted (i.e., target data fields). A suitable process must then be applied to this table or view of training data to generate predictive models that map values of the explanatory data fields into values of the target data fields. Once generated, a predictive model can then be applied to rows of another database table or view for which the values of the target data fields are unknown, and the resulting predicted values can then be used as a basis for decision making.
Thus, a process for constructing a predictive model is essentially a type of database query that produces as output a specification of a desired data transformation (i.e., a predictive model) that can then be applied in subsequent database queries to generate predictions.
To make predictive modeling technology readily available to database applications developers, extensions to the SQL database query language are being jointly developed by the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC) to support the construction and application of predictive models within database systems (see, for example, ISO/IEC FCD 13249-6:200x(E), “Information technology—Database languages—SQL Multimedia and Application Packages—Part 6: Data Mining,” Document Reference Number ISO/IEC JTC 1/SC 32N0848, Jun. 20, 2002, (Final Draft International Standard) http://www.jtc1sc32.org/sc32/jtc1sc32.nsf/Attachments/39E375F33B51135788256BDD00835045/$FILE/32N0848.PDF.
For an overview, see J. Melton and A. Eisenberg, “SQL Multimedia and Application Packages (SQL/MM),” SIGMOD Record, Vol. 30, No. 4, pp. 97-102, 2001, http://www.acm.org/sigmod/record/issues/0112/standards.pdf). This ISO/IEC standard aims to provide SQL structured types and associated functions for creating data mining task specifications, executing data mining tasks, querying data mining results, and, in cases where the results are predictive models, applying data mining results to row data to generate predictions.
For example, the ISO/IEC standard requires that both data mining task specifications and data mining results be stored as Character Large Objects (CLOBs), using an encoding format that is consistent with the Predictive Modeling Markup Language (PMML) standard that is being developed separately through the Data Mining Group (http://www.dmg.org). The ISO/IEC standard likewise specifies sets of functions to be used for manipulating these database objects. By providing a standard application programming interface (API) for utilizing data mining technology with database systems, the standard is expected to promote wide use of data mining technology by enabling database application developers to readily apply such technology in business applications simply by writing SQL queries. In so doing, the standard effectively makes data mining a component technology of database systems.
The ISO/IEC data mining standard likewise serves a clear acknowledgment that predictive model technology produces useful, concrete, and tangible results that have specific meaning with respect to the input data and the user-specified modeling objectives (i.e., which data field to predict in terms of which other data fields). Indeed, if this were not the case, there would be no reason to create an international database standard for utilizing such technology. From a pragmatic database perspective, the specification of the input data and the modeling objectives constitutes a query, and the predictive model that is produced as output constitutes a query result. The processes provided by predictive modeling technology are utilized by the query engine in order to produce the query results.
Decision-tree classifiers provide a convenient illustration of the usefulness of predictive modeling technology. Well-known procedures exist for constructing such models. The usual method is summarized as follows by Quinlan (see J. R. Quinlan, “Unknown attribute values in induction,” Proceedings of the Sixth International Machine Learning Workshop, pp 164-168, Morgan Kaufmann Publishers, 1989):
“The ‘standard’ technique for constructing a decision tree classifier from a training set of cases with known classes, each described in terms of fixed attributes, can be summarized as follows:                If all training cases belong to a single class, the tree is a leaf labeled with that class;        Otherwise:                    select a test, based on one attribute, with mutually exclusive outcomes;            divide the training set into subsets, each corresponding to one outcome; and            apply the same procedure to each subset.”                        
Details on the individual method steps can be found, for example, in the on-line statistics textbook provided over the Internet as a public service by StatSoft, Inc.
The usefulness of decision tree technology is best illustrated by means of a concrete example. Table 1 below shows the data field definitions for a data set commonly known within the predictive modeling community as the “Boston Housing” data (D. Harrison and D. L. Rubinfield, “Hedonic prices and the demand for clean air,” Journal of Environmental Economics and Management, Vol. 5, pp 81-102, 1978). Table 2 below shows twelve exemplary of the rows from this data set. A complete copy of the data set can be obtained over the Internet from the UCI Machine Learning Repository (http://www.ics.uci.edu/˜mlearn/MLRepository.html).
TABLE 1(Data definition for the Boston Housing data set.Data fields have been assigned more intuitive names.The original names appear in the “a.k.a.” column.)Data Fielda.k.a.DescriptionPRICEMEDVMedian value of owner-occupied homes(recoded into equiprobable HIGH,MEDIUM, and LOW ranges)ON_RIVERCHASCharles River indicator (value is 1 if tractbounds Charles River; else 0)CRIME_RTCRIMPer capita crime rate by town%BIGLOTSZNPercentage of residential land zoned for lotsover 25,000 square feet%INDUSTYINDUSPercentage of non-retail business acres pertownNOXLEVELNOXConcentration of nitric oxides (recoded intoequiprobable high, medium, and low ranges)AVGNUMRMRMAverage number of rooms per dwelling%OLDBLDGAGEPercentage of owner-occupied units builtprior to 1940DIST2WRKDISWeighted distances to five Bostonemployment centersHWYACCESRADIndex of accessibility to radial highwaysTAX_RATETAXFull-valued property tax rate per $10,000CLASSIZEPTRATIOPupil-teacher ratio by town%LOWINCMLSTATPercent lower status of the population
TABLE 2(a)(Twelve sample rows from the Boston Housing data set (Part 1 of 3).)%BIG-ROWPRICEON_RIVERCRIME_RTLOTS%INDUSTY1HIGH00.00618.002.312MEDIUM00.0270.007.073HIGH00.0320.002.184MEDIUM00.08812.507.875LOW00.21112.507.876MEDIUM00.6300.008.147MEDIUM00.15425.005.138MEDIUM00.1010.0010.019LOW00.2590.0021.8910LOW13.3210.0019.5811LOW00.20622.005.8612LOW18.9830.0018.10
TABLE 2(b)(Twelve sample rows from the Boston Housing data set (Part 2 of 3).)ROWNOXLEVELAVGNUMRM%OLDBLDGDIST2WRK1medium6.5865.204.092low6.4278.904.973low7.0045.806.064medium6.0166.605.565medium5.63100.006.086medium5.9561.804.717low6.1429.207.828medium6.7181.602.689high5.6996.001.7910high5.40100.001.3211low5.5976.507.9612high6.2197.402.12
TABLE 2(c)(Twelve sample rows from the Boston Housing data set (Part 3 of 3).)ROWHWYACCESTAX_RATECLASSIZE%LOWINCM1129615.304.982224217.809.143322218.702.944531115.2012.435531115.2029.936430721.008.267828419.706.868643217.8010.169443721.2017.1910540314.7026.8211733019.1012.50122466620.2017.60
Harrison and Rubinfield collected and analyzed these data to determine whether air pollution had any effect on house values within the greater Boston area. One approach to addressing this question is to build a model that predicts house price as a function of air pollution and other factors that could potentially affect house prices.
FIG. 1 shows a decision tree 100 generated from the Boston Housing data using the CART algorithm (L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression Trees, New York: Chapman & Hall, 1984) as implemented in STATISTICA for Windows (STATISTICA for Windows [Computerprogram manual], Version 5.5, 1995, StatSoft, Inc., 2300. East 14th Street, Tulsa, Okla., 74104-4442). The STATISTICA program was told to construct a decision tree model that predicts PRICE (i.e., the median value of owner-occupied homes broken down into high, medium, and low ranges) using all of the other columns in the data table as potential inputs to the model (i.e., as explanatory data fields).
Each node 1 through 13 in the tree shown in FIG. 1 corresponds to a data segment (i.e., a subset of the data). Illustrated at each node are histograms of the proportions of high-, medium-, and low-priced neighborhoods that belong to the corresponding data segments. The price range that corresponds to each histogram bar is indicated by legend 14. Each node in FIG. 1 is also labeled with the dominant price range within the corresponding segment (i.e., the price range that has the largest histogram bar). Thus, for node 1, the dominant price range is medium, whereas for nodes 2 and 3 the dominant price ranges are high and low, respectively.
Tree branches correspond to tests on the values of the inputs to the predictive model and it is these tests that define the data segments that correspond to each node in the tree. For example, in FIG. 1, node 1 is the root of the tree and it corresponds to the entire set of data. Test 15 (i.e., % LOWINCM≦14.4) defines the data segments that correspond to nodes 2 and 3.
Left-going branches in FIG. 1 are followed when the outcome of the corresponding test is “yes” or “true.” Right-going branches are followed when the outcome of the test is “no” or “false.” Thus, node 2 corresponds to the subset of data for which % LOWINCM is less than or equal to 14.4, and node 3 corresponds to the subset of data for which % LOWINCM is greater than 14.4. Similarly, node 4 corresponds to the subset of data for which % LOWINCM is less than or equal to 14.4 and AVGNUMRM is less than or equal to 6.527, and so on.
The leaves of the tree (i.e., nodes 4, 5, 7, 8, 10, 12, and 13) correspond to the subsets of data that are used to make predictions in the decision tree model. In this example, the predictions are the dominant price ranges at the leaves of the tree. Thus, at node 4 the prediction would be “medium,” at node 5 it would be “high,” at node 7 it would be “low,” etc.
FIG. 1 demonstrates the ability of decision tree programs to automatically extract meaningful patterns from collections of data. As the tree model indicates, air pollution does have an effect on house prices, but only for neighborhoods that have a sufficiently large percentage of low-income housing (i.e., % LOWINCM>14.4). For all other neighborhoods, house prices are primarily affected by the size of the house, as indicated by the average number of rooms per house in the neighborhood (i.e, AVGNUMRM).
When air pollution is a factor, but the air pollution level is sufficiently small, then the next most predictive factors that affect house prices are crime rate, the percentage of non-retail industrial land, and the distance to a major center of employment, with the more desirable (i.e., higher-priced) neighborhoods being those with low crime rates (i.e., node 8) and those with sufficiently large percentages of non-retail industrial land located away from centers of employment (i.e., node 13).
To demonstrate that decision tree algorithms are not application-specific, but can be applied to any application simply by providing application-specific data as input, the STATISTICA program was executed again, but this time it was told to predict the air pollution level (i.e., NOXLEVEL) using all of the other data columns as explanatory variables, including PRICE. FIG. 2 shows the resulting tree model 200. As this tree illustrates, the majority of neighborhoods that have the highest levels of air pollution (i.e., node 28) are those with sufficiently large percentages of non-retail industrial land, sufficiently large percentages of older buildings, and sufficiently high tax rates.
Not surprisingly, these factors characterize downtown Boston and its immediate vicinity. The majority of neighborhoods that have the lowest levels of air pollution (i.e., node 26) are those with sufficiently small percentages of non-retail industrial land, sufficiently large percentages of houses on large lots, and that are sufficiently far from centers of employment. These characteristics are typical of outlying suburbs. The majority of neighborhoods that have moderate levels of air pollution (i.e., node 29) are those with sufficiently small percentages of non-retail industrial land, sufficiently small percentages of houses on large lots, and easy access to radial highways that lead into Boston. These characteristics are typical of urban residential neighborhoods favored by commuters.
For both FIGS. 1 and 2, the relationships described above make intuitive sense, once the tree models are examined in detail. However, it is important to keep in mind that the STATISTICA program itself has no knowledge of these intuitions nor of the source of data. The program is merely analyzing the data to identify patterns that have predictive value.
Nevertheless, the program produces meaningful results. The decision tree models that are produced as output are useful, concrete, and tangible results that have specific meaning with respect to the input data and the user-specified modeling objectives (i.e., which data field to predict in terms of which other data fields). From a database perspective, the specification of the input data and the modeling objectives constitutes a query, and the decision tree model that is produced as output constitutes a query result.
The usefulness of decision tree algorithms, in particular, and automated predictive modeling technology, in general, derives from the fact that they can perform their analyses automatically without human intervention, and without being told what kinds of relationships to look for. All that they need to be told is which data values are to be predicted, and which data values can be used as inputs to make those predictions. The generic nature of such technology makes the technology extremely useful for the purpose of knowledge discovery in databases. Moreover, it is the generic nature of predictive modeling technology that permits the technology to be incorporated into general-purpose database systems.
Note that, once a decision tree has been constructed—or, for that matter, once any type of predictive model has been constructed—the step of applying that model to generate predictions for an intended application is conventional, obvious, and noninventive to those skilled in the art of predictive modeling.
Although decision tree methods yield models that can be interpreted and understood for the purposes of knowledge discovery, the predictive accuracy of decision tree models can be significantly lower than the predictive accuracies that can be obtained using other modeling methods. This lower accuracy stems from the fact that decision trees are piecewise-constant models; that is, within each data segment defined by the leaves of the tree, the predictions produced by the model are the same for all members of that segment.
FIG. 3 illustrates this effect 300 in the case of regression trees, which are decision trees used to predict numerical values instead of categorical values. As FIG. 3 indicates, the output 39 of a piecewise-constant model (such as one produced by conventional decision tree algorithms) is stair-like in nature and is therefore inherently inaccurate when used to model data 38 that exhibits smooth variations in values relative to the inputs of the model. The strength of decision tree models, however, is that they are quite good at modeling any nonlinearities that might exist, as FIG. 3A demonstrates.
To overcome the deficiencies of the piecewise-constant aspect of decision trees, Natarajan and Pednault have developed a method for constructing tree-based models with multivariate statistical models in the leaves—specifically, linear regression models and naive-Bayes models (R. Natarajan and E. P. D. Pednault, “Segmented Regression Estimators for Massive Data Sets,” Proceedings of the Second SIAM International Conference on Data Mining (on CD-ROM), Arlington, Va., April 2002), the contents of which are hereby incorporated by reference.
This method is further described in the above-identified copending patent application. FIGS. 3B and 3C show how this segmented regression method works. In the initial model 301 shown in FIG. 3B, a first linear estimation 302 of the data is modeled. As shown in FIG. 3C, the linear model 301 is refined into a linear segmented model 303 by calculating linear estimates 304-307 for a number of segments. The number of segments and the segment boundaries are determined by applying a top-down process for building decision trees in which the tree branches define segment boundaries and the leaves of the decision trees contain linear regression models.
However, the above segmented regression method is limited by another deficiency of tree-based predictive modeling methods, which is that one quickly runs out of data as a result of dividing data into numerous subsets that correspond to the leaves of a tree. Less data implies greater estimation errors in the parameters of the leaf models, and these estimation errors can in turn lower the predictive accuracy of the resulting model relative to what could be achieved using other modeling techniques.
Thus, the problem remains in predictive modeling to provide an accurate model via a process that quickly converges, using limited amounts of data.