Mathematical and analytical models may be used to recognize hidden predictive patterns in a data set. The kinds of problems for which a model may be used include clustering, classification, estimation and association of data in the data set. There are several types of models that are commonly used, such as probabilistic radial functions (such as probabilistic neural networks, generalized regression neural networks and Gaussian radial basis functions), decision trees (such as K-D trees and neural trees), neural networks, Kohonen networks and other associative algorithms.
Each datum in a data set generally is defined by a vector of one or more input fields and of one or more output fields. Given a kind of problem and a kind of model, input fields that affect the solution to the defined problem are identified and standardized. Any target output fields also are identified and standardized. A training data set to be used to generate or train the model then is prepared. The training data set is typically a subset of a database or large data set. The training data set generally is created using stratified sampling of a large data set. For example, a large customer database containing over several million records may be sampled to create a training set of approximately several thousand entries that generally represents the overall customer base.
A database often contains sparse, i.e., under-represented, conditions which might be not represented in a training data set if the training data set is created by stratified sampling. A model trained using such a stratified sample ultimately would not represent the sparse conditions. The sparse conditions may be important conditions for the model, especially if the model is intended to evaluate risks, such as credit risk or fraud.
Sparse conditions may be represented in a training set by using a data set which includes essentially all of the data in a database, without stratified sampling. A series of samples, or xe2x80x9cwindows,xe2x80x9d are used to select portions of the large data set for phases of training. In general, the first window of data should be a reasonably broad sample of the data. After the model is initially trained using a first window of data, subsequent windows are used to retrain the model. For some model types, the model is modified in order to provide it with some retention of training obtained using previous windows of data. Neural networks and Kohonen networks may be used without modification. Models such as probabilistic neural networks, generalized regression neural networks, Gaussian radial basis functions, decision trees, including K-D trees and neural trees, are modified to provide them with properties of memory to retain the effects of training with previous training data sets. Such a modification may be provided using clustering. Parallel training models which partition the training data set into disjoint subsets are modified so that the partitioner is trained only on the first window of data, whereas subsequent windows are used to train the models to which the partitioner applies the data in parallel.
An advantage of this method of training a model using a large training data set is that the first window of data essentially completely trains the model. Subsequent windows of data are less likely to modify the model if the first window is a substantially broad sample of the entire data set. Time for additional training is incurred primarily due to sparse conditions in the data set, which are precisely those conditions which training with a large data set is intended to capture. If some of these sparse conditions are identified prior to training, data representing these sparse conditions may be re-introduced several times into the training process to ensure that the model represents these conditions.