Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery, process optimization, and the like. The trend to larger data sets is due to the increasing abundance of information, both being recorded and derivable from analysis of large sets of related data.
Scorecards are mathematical models that attempt to provide a quantitative estimate of the probability that a consumer will display a defined behavior (e.g., accept another product if offered one, loan default, bankruptcy or a lower level of delinquency). Scorecards are built and optimized to evaluate the credit file of a homogeneous population (e.g. files with delinquencies, files that are very young, files that have very little information). Many traditional empirically derived scoring systems have between 10 and 20 variables.
A widespread use of scorecards is for credit scoring. Credit scoring typically uses observations or data from individuals who defaulted on their loans plus observations on a large number of consumers who have not defaulted. Statistically, estimation techniques such as logistic regression or probit are used to create estimates of the probability of default for observations based on this historical data, although other techniques can be used. The credit score model can be used to predict probability of default for new individuals using the same observation characteristics or variables (e.g., age, income, house owner). The default probabilities are then scaled to a “credit score.” This score ranks individuals by riskiness without explicitly identifying their probability of default.