The present invention relates to predictive systems where the objective of the prediction is to model the probability that a certain event will occur given the parameters of population membership. Predictive models create value by taking available data samples and then applying some modeling technique to the data. Common modeling techniques include linear regression, logistic regression, neural networks, classification and regression tree (CART), and other techniques. A key requirement of each of these methods is that they require a set of functional relationships, or input-output pairs (Z, Y) as the starting point of the modeling process. The present invention addresses the circumstance where such input-output, pairs are not readily available and must be synthesized from distributions of samples that contain the event of interest and samples that do not contain the event of interest. Some examples of how this data requirement impacts other approaches are described below.
Many systems create models by using regression techniques. Whether linear, nonlinear, logistic, neural network or otherwise, all of these techniques require a well-defined set of functional pairs against which the model is fit. The present invention creates predictive segments as a pre-processing step to a regression modeling system or can be used as a fully functional predictive model by itself.
Clustering techniques, such as K-means or vector quantization, define groupings from which density functions can be defined, and hence can be used as means of generating input-output pair's to be used a pre-processing step to a predictive modeling process, such as a regression model. However, shortcomings of clustering techniques, which are addressed by the present invention are (i) clusters may not be predictive; that is, the clustering and differentiation of the input variable space may be different than the clustering and differentiation of the output variable space; (ii) the methods are computationally expensive; that is, they require a large number of iterative calculations to adjust the clusters to convergence (although only against the clustering criteria of the input space, not the output/prediction space); and (iii) determination of the number of clusters is difficult and may require trial and error, particularly given the non-guarantee of the predictability of the clusters; and (iv) the clustering is further complicated by the existence of two distributions, a normalizing distribution, and the differentiated distribution.
The present invention is similar to classification and regression trees (CART) in that it generates progressive levels of segmentation based on the significance of data. However, the significant drawback of CART is that CART assumes that the functional pairs already exist. The present invention can be applied to the circumstance where input-output pairs exist, but more importantly also applies in cases where the functional pairs are not defined as part of the data set. Also, the present invention has the benefit that it produces natural predictive segments of the input variables relative to the output variables.