The present invention relates to classifying data. More particularly, the present invention relates to methods and apparatus for classifying documents, including textual documents, using Support Vector Machine (SVM) classifiers.
A discussion of the need for data classification and some earlier approaches to it, including rule-based systems, Bayesian networks and Neural Networks, can be found in Dumais, et al., U.S. Pat. No. 6,192,360.
The increasing use of statistical methods in the sciences in recent decades spurred the search for improved statistical approaches to classifying large amounts of data. Some wanted to read handwritten documents, which meant that they needed to classify a great many samples of handwritten words. Others wanted to separate photographed faces from their background. Medical researchers wanted to separate lab characteristics of benign and malignant tumors. The rise and spread of the World Wide Web in the 1990s greatly increased the dissatisfaction with existing statistical methods for searching and classifying large amounts of text. Beginning in the early 1990s, Vladimir Vapnik and others devised a new statistical approach to large classification problems. This method is referred to as a Support Vector Machine (SVM), a mathematical technique with which it is possible to teach a computer to classify large amounts of data more reliably than ever before. Mathematicians and scientists continue to search for ways to improve these new learning machines.
SVMs are powerful tools for solving large-scale classification problems such as those sometimes seen when classifying text. But, quite commonly, it is characteristic of many text classification problems that while they involve a large number of pages and a large number of words, any given page naturally contains only a tiny percentage of the words found in the English language, and a very small percentage of those seen in the entire group of pages. That is to say that if we look in a single document among a great many documents in a related set of documents for all of the words found throughout the entire set, we will find far more words missing from the document than are found in the document. This is referred to as a sparse data matrix. Such classification problems as these, problems with a huge number of documents and a large number of words and a sparse data matrix, require a classification engine that can achieve fast processing performance.
For many such problems, however, an amount of computer memory ordinarily not available is needed for SVM handling of the data presented by text-intensive problems such as those often found in categorizing large numbers of text pages found on the Web. One solution which has improved the ability of computers to learn to classify such data was ‘chunking’, in which the problem was broken up into smaller, more manageable pieces that are within the processing capability of commonly available computer resources. Among such chunking decomposition techniques used to reduce such problems for SVMs are classifier techniques known as SMO and SVM Light. However, the speed improvement made possible with these techniques often is modest, especially for designing classifiers—such as those needed for web pages—where some of the largest and most difficult text problems are to be found. As a consequence, it is desirable to build a faster classifier engine for Web pages, and other large text problems. And, of course, classification problems other than text, but of similar size and similar data characteristics, also benefit from increased machine learning speed.
Speed is very desirable, but existing techniques, including the decomposition techniques referred to as SVM Light and SMO do not facilitate fast learning for classifiers of the large numbers of documents found in many classification problems. Hence there is a need for an SVM classifier design with accuracy that matches that of existing classifier engines but is significantly faster is needed to reduce the training time of Support Vector Machines.