Machine learning can be employed to generate a model that is capable of classifying documents. As one example, Sahami et al. describe an approach for training a Bayesian model to classify email messages as junk (spam) or not junk. M. Sahami et al. “A Bayesian approach to filtering junk e-mail,” AAAI'98 Workshop on Learning for Text Categorization.
Within the technical art of machine classification, obtaining a useful set of training examples presents a challenge. It is important to expose the classifier to a wide variety of training examples, so that the classifier can properly generalize its classification function. In the some contexts, it can be difficult to efficiently obtain a training set. For example, many datasets are not stored in a format that is readily accessible, in contrast to emails, which are largely text-based. Furthermore, some datasets are vast but only include a small number of positive or negative examples, making the task of finding suitable training examples one of finding a needle in a haystack.