Many problems in information processing involve the selection or classification of items in a large data set. For example, web-based companies such as Yahoo! have to frequently classify web pages as belonging to one group or the other, e.g., as commercial or non-commercial.
Currently, large amounts of data can be cheaply and automatically collected. However, labeling of the data typically involves expensive and fallible human participation. For example, a single web-crawl by a search engine, such as Yahoo or Google, indexes billions of webpages. Only a very small fraction of these web-pages can be hand-labeled by human editorial teams and assembled into topic directories. The remaining web-pages form a massive collection of unlabeled documents.
The modified finite Newton algorithm, described in co-pending application Ser. No. 10/949,821, entitled “A Method And Apparatus For Efficient Training Of Support Vector Machines,” filed Sep. 24, 2004, the entirety of which is incorporated herein by reference, describes a method for training linear support vector machines (SVMs) on sparse datasets with a potentially very large number of examples and features. Such datasets are generated frequently in domains like document classification. However, the system and method described in that application incorporates only labeled data in a finite Newton algorithm (abbreviated L2-SVM-MFN). Large scale learning is often realistic only in a semi-supervised setting where a small set of labeled examples is available together with a large collection of unlabeled data.
A system and method that provides for extension of linear SVMs for semi-supervised classification problems involving large, and possibly very high-dimensional but sparse, partially labeled datasets is desirable. In many information retrieval and data mining applications, linear methods are strongly preferred because of their ease of implementation, interpretability and empirical performance. The preferred embodiments of the system and method described herein clearly address this and other needs.