1. Technical Field
The present invention relates to an improved data processing system. In particular, the present invention relates to a method and system for selecting sample data to test and train predictive algorithms of customer behavior.
2. Description of Related Art
Currently, when using artificial intelligence algorithms to discover patterns in behavior exhibited by customers, it is necessary to create training data sets where a predicted outcome is known as well as testing data sets where the predicted outcome is known to be able to validate the accuracy of a predictive algorithm. The predictive algorithm, for example, may be designed to predict a customer's propensity to respond to an offer or his propensity to buy a product.
The data used to train and test the algorithm are selected using a random selection procedure, such as selecting data based upon a random number generator, or by some other means to insure that both the training data and test data sets are representative of the entire data population being evaluated. Tests of randomness of each of the attributes, e.g., the demographic information of the individuals, in the data sets can then be completed to see if they represent a randomly selected population.
While the above approach to selecting testing and training data sets may be suited for some applications, the purchasing behavior of customers is not only based on demographic and cyclographic information. Rather, geographic locations also influence a customer's purchasing behavior.
People tend to co-locate based on common interests and common backgrounds. That is, people tend to co-locate with other persons with which they have common characteristics. This effect is known as the “nugget” effect. In much the same way that gold, due to its inert chemistry is rarely evenly distributed through rock and is thus, found in nuggets within a particular geographic formation, people also tend to “nugget” in geographical areas. Such “nuggeting” of individuals is not taken into consideration when selecting training and testing data for a predictive algorithm in the known systems. Thus, bias may be introduced into either the test data, train data, or both data sets making either or both nonrepresentative of the overall customer database.
Therefore, it would be beneficial to have a method and system for selecting a data sample for testing, training and using discovery based data mining in a customer relationship marketing predictive system which takes into consideration any geographic bias that may exist in the original customer database and/or in the selected data samples.