In the United States, a “credit score” is a number that represents creditworthiness of a person, which can correspond to likelihood that that person will pay his or her debts. Lenders, such as, banks, credit card companies, and other financial and governmental institutions, use such credit scores to evaluate risks posed by lending money to consumers. Credit scores are a good and cheaper way for such lenders to provide and for consumers to obtain credit.
There are a number of companies that are involved in determining credit scores for consumers (which can be individuals, businesses, or any other entities). One of the best known and used scores is a FICO score which is calculated statistically using information from a consumer's credit files and has been developed by Fair Isaac Corporation, Minneapolis, Minn., USA. The credit score typically provides a snapshot of risk that lenders use to help make lending decisions. For example, consumers having higher FICO scores might be offered better interest rates as well as higher amounts of credit, such as, mortgages, automotive loans, business loans, etc. The credit score is based on payment history (e.g., whether or not the consumer pays his/her bills on time), credit utilization (i.e., the ratio of current revolving debt (e.g., credit card balances) to the total available revolving credit or credit limit), length of credit history (i.e., a consumer that has a longer credit history may have a higher credit score), types of credit used (e.g., installment, revolving, consumer finance, mortgage), and recent searches for credit. Other factors such as, any money owed because of a court judgment, tax lien, having one or more newly opened consumer finance credit accounts, etc. can have an impact on the credit score. The credit scores can range between 300 and 850, where lower score indicates poorer creditworthiness of a consumer and higher score indicates a better creditworthiness of a consumer.
Various algorithmic approaches are used in the credit-score determination industry to determine a credit score for a consumer based on the available data (such as the data described above). One of the approaches includes a use of a classification and regression tree (“CART”) algorithm. The CART algorithm involves a classification tree analysis and a regression tree analysis, both of which are decision tree learning algorithms that map observations about an item to conclusions about the item's target value. In these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. In decision analysis, a decision tree can be used to visually and explicitly represent decisions and decision making. The classification tree analysis is a decision tree learning algorithm where the predicted outcome can represent a class to which a particular data belongs (e.g., a homeowner, a renter, etc.). The regression tree analysis is a decision tree learning algorithm where the predicted outcome can represent a real number (e.g., height, age, etc.). For each decision tree to be assessed as a segmentation tree, a predictive model for every leaf node is developed and a hold-out sample for each leaf node with the appropriate model is scored. A segmentation tree defines logic for dividing a population into two or more subpopulations. Since the relationship between predictors and the target outcome can often vary between subpopulations, developing different models for different subpopulations frequently results in a more powerful scoring system compared to using a single model for the entire population. A model is trained for each leaf node, and the full scoring system is calibrated so the relationship between score and the predicted outcome is consistent across all leaf node models so total system performance for the full population can be calculated. A comparison of a total system performance of each decision tree being evaluated is carried out.
However, such classification and regression tree algorithms are optimized to directly predict the target variable, and have no ability to identify sub-populations with varying relationships that can exist between the predictors in each sub-populations and the target. As a result, these trees are unlikely to result in the best performing scoring system. Further, while there are thousands of potential candidate trees, only a small number of candidate trees can be evaluated with this approach, as configuration and subsequent evaluation of each candidate tree is a manual process.