In situations where one has access to massive amounts of data, the cost of building a statistical model to characterize the data can be significant if not insurmountable. The accuracy of the model and the cost of building the model are competing interests associated with building a statistical model. That is, while the use of a larger data set may provide a more accurate model than a smaller set of the data, the analysis of data tends to become increasingly inefficient and expensive with larger data sets. Because of the computational complexity associated with analyzing large data sets, a common practice is to build a model on the basis of a sample of the data. However, the choice of the size of the sample to use is far from clear.
Various methodologies have been proposed to employ progressive samples to analyze data in order to find an adequate sample size for which a model of reasonable quality can be constructed. A learning curve method (also known as progressive sampling) is one approach to evaluate the relationship between the accuracy of a model and the cost of learning the model. The basic idea of a learning curve method is to iteratively apply a learning method to larger and larger subsets of the data until the increasing costs of learning from larger subsets outweigh the increasing benefit of accuracy. FIG. 1 illustrates a typical learning curve, illustrating the relationship between benefit and cost. As shown in FIG. 1, the learning curve has a steeply sloping portion early in the curve, a more gently sloping middle portion, and a plateau late in the curve.
There are three main components of a learning curve method. The first component is the data policy, which is the sampling schedule by which one uses portions of the data set to train a model. The second component is the training policy (or induction algorithm), which defines how one applies a training method to the data. The final component is the stopping criterion, which is how one determines that the cost associated with further training exceeds the benefit of improved performance.