A great deal of attention has been given to automated document-classification techniques. For example, as the volume of digital data has exploded in recent years, there is significant demand for techniques to organize, sort and/or identify such data in a manner that allows it to be useful for a specified purpose.
Automated classification of digital information has application in a number of different practical situations, including text classification (e.g., determining whether a particular e-mail message is spam based on its textual content) and the like. A variety of different techniques for automatically classifying documents exist.
One kind of classification technique uses a supervised classifier, such as Support Vector Machine (SVM) or Naïve Bayes. Generally speaking, supervised classifiers input feature vectors for a number of labeled training samples, i.e., labeled as to whether or not they belong to a category. Then, based on such training information, the classifier generates a function for mapping an arbitrary feature vector into a decision as to whether or not the corresponding document belongs in the category. When a new unlabeled document (or, more specifically, its feature vector) is input, the function is applied to determine whether the document belongs in the category. Unfortunately, the present inventor has discovered that such supervised classifiers often are too slow, particularly when many documents are to be classified.
Another approach that has been suggested is to classify documents by constructing a search-engine query, where the query itself functions as the definition for a particular category; once created, such a query is submitted to the corresponding search engine, and all the returned results are then automatically assigned to the category. See, e.g., A. Anagnostopoulos, et al., “Effective and Efficient Classification on a Search-Engine Model” CIKM'06, Nov. 5-11, 2006, Arlington, Va., USA. While potentially faster than supervised classification techniques, such an approach generally is not as accurate. For example, even the Anagnostopoulos article itself notes that such a technique only achieves 86% or 90% of the accuracy of the best SVM classifier, even under the conditions chosen for the authors' own experiments.