1. Technical Field
The present invention relates in general to data analysis and in particular to qualifying sample populations employed in predictive data analysis. Still more particularly, the present invention relates to reducing the number of attributes of a sample population employed in generating a predictive model based on the sample population.
2. Description of the Related Art
A wide array of subjects are the focus of contemporary data collection, such as customer data which is collected by various industries, medical information which is collected for development of diagnostics and treatment protocols, or data relating to insurable or potentially insurable activities which is collected for insurance risk assessment. Collection of such data has become so routine and pervasive that “data mining” is frequently required to separate useful information from dross.
Data collection is frequently undertaken for the purposes of developing predictive models. That is, by statistical analysis of characteristics of a sample population, attempts are made to derive models which may predict, with reasonable accuracy, whether an individual subject will exhibit a characteristic or group of characteristics of interest based on known characteristics of that subject. For instance, marketing firms may attempt to develop predictive models for determining which individuals within a target population are most likely to respond to a particular promotional campaign.
Contemporary data collection generally proceeds more or less indiscriminately. That is, those engaged in collection of data typically collect as much data regarding each individual subject as possible, without regard to the ultimate usefulness of the data in, for example, developing a predictive model. This may result from uncertainty regarding which characteristics are most useful for a particular purpose and/or the simplistic conviction that more data will produce better results. More frequently, however, indiscriminate data collection results instead from the use of a data set for more than one purpose, spreading the cost of the data collection among multiple projects.
One effect of indiscriminate data collection on the development of predictive models is the inefficiency and error introduced by large data samples. A sample population may include data for five hundred or more characteristics of each individual subject within the sample population. Attempting to generate a predictive model based on that many individual characteristics is computationally inefficient. Furthermore, as the number of characteristics or attributes employed in generating the predictive model increases, the probability that the sample population is skewed by one or more characteristics or attributes also increases.
It would be desirable, therefore, to provide a mechanism for preprocessing a sample population to reduce the number of attributes or characteristics employed in generating a predictive model. It would further be advantageous if the mechanism eliminated characteristics which might skew the sample population and thereby degrade the accuracy of a predictive model generated from such data.