Data mining is a technique by which hidden patterns may be found in a group of data. True data mining doesn't just change the presentation of data, but actually discovers previously unknown relationships among the data. Data mining is typically implemented as software in or in association with database systems. Data mining includes several major steps. First, data mining models are generated by based on one or more data analysis algorithms. Initially, the models are “untrained”, but are “trained” by processing training data and generating information that defines the model. The generated information is then deployed for use in data mining, for example, by providing predictions of future behavior based on specific past behavior.
One type of modeling that is useful in building data mining models is neural network modeling. Generally, a neural network is a set of connected input/output units where each connection has a weight associated with it. During the learning phase, the network learns by adjusting the weights so as to be able to more accurately generate an output based on the input samples.
Traditionally, neural network models are trained using batch methods, in which large amounts of data are used to train the models. However, problems arise with these batch-training methods because the size of the data sample to be used for training must be specified. For large datasets, if all the rows of data in the dataset are used, the computation of necessary information, such as gradient and cost function information, becomes too computationally expensive. One solution to this problem is to sample the data in the dataset and only use the sample to train the model. However, this present a problem because the proper sample size must be chosen for best results. If the sample chosen is too large, the computation is still too expensive, while if the sample chosen is too small, the trained model is not adequately predictive of the dataset. Thus, the sample size must be chosen intelligently. Because each model or type of model requires different sample sizes, there is no fixed sample size that will work properly for all cases.
A need arises for an automated technique that determines the size of the sample that is to be used in training a neural network data mining model.