The invention relates generally to a process of creating predictive models and, more particularly, to a method of creating predictive models with incomplete data using genetic algorithms. The invention may be employed, for example, to create propensity models.
Customer relationship management (CRM) has become the key to growth in today's highly competitive market. Information about customers' interests and their earning and spending behavior is useful for companies to identify which section of the market they are catering to. Such information also helps the companies to predict the relevant aspects of a customer's behavior, for example, how likely the customer is to respond to an offer, how much the customer is likely to borrow etc. Companies therefore maintain databases of their customers replete with such information and conduct surveys and create customer response sheets to gather data in a database to build predictive models.
The method of obtaining data from the customers plays a major role in deciding the level of completeness of the data set. For example, a data set obtained from the details given by customers that are necessary pre-requisites to opening an account will typically be complete. But a data set obtained from data provided on other bases, such as human resource profiling sheets completed by customers may be incomplete because the information is not mandatory. Finally, data sets most susceptible to gaps and faults will typically be those obtained by conducting surveys amongst customers because nothing is mandatory in such surveys.
However, the only basis for conducting needed market analysis is data. The quality of the data used will generally be reflected in the quality of the resulting analysis. Therefore, to conduct a comprehensive analysis, it is indispensable that data sets be used that are complete as possible. Omission of any customer from the analysis could simply translate to loss of valuable business and/or loss of accuracy in the resulting analysis.
A typical customer database may be presented as a table of rows and columns where each row corresponds to a customer while the columns correspond to different information about the customer, such as account level information provided by the company, personal information provided by the customer, or a behavior or bureau score provided by a credit scoring agency. The table may contain blank cells where data is missing. It is generally desirable to complete otherwise compensate for the missing data. The problem is essentially to impute some value in these blank cells so as to provide the maximum amount of information in the database and thereby to enable a good predictive model to be built from that information.
Missing values can be imputed using methods such as mean imputation, hot deck imputation, cold deck imputation, regression analysis, propensity score analysis and multiple imputation. Mean imputation, hot deck imputation and cold deck imputation are relatively naïve and inappropriate to be used profitably in a large data sheet. Regression analysis also typically provides extremely inaccurate results if the data do not follow a linear model, especially in large data sheets. This also rules out predictive mean matching for large data sets. Propensity score analysis too is generally inaccurate for large data sets. Multiple imputation is a rigorous process which involves finding multiple estimates for each missing value from several samples of complete data and then combining all these estimates to get the final value to be imputed. However, in many cases, multiple imputation may necessitate too much overhead, investment and computation cost to justify its use for missing value problems.
The problem of missing value imputation is a small part of the entire process of predictive modeling, but the quality of the model is dependent on the information used to build the model. Therefore, there is a need for a method capable of providing good imputation for the missing values and yet does not require a separate and exhaustive process.