Existing relational learning techniques may be applied for text classification including documents such as web pages. Typically, relational learning techniques may start with a classification method, such as linear classification, and make improvements in the classification using the text information provided. In particular, hyperlinks among web documents may provide useful information for improving accuracy of document classification. For example, hyperlink information has been used to refine classes of graph neighbors seeded with a text only classifier by applying an EM-like technique to significantly improve Yahoo Directory classification accuracy. See S. Chakrabarti, B. Dom, and P. Indyk, Enhanced Hypertext Categorization Using Hyperlinks, In SIGMOD'98, 1998. Other techniques have been applied for aggregating neighborhood class assignments. See S. Macskassy and F. Provost, Classification in Networked Data: A Toolkit and a Univariate Case Study, Technical Report CeDER-04-08, Stern School of Business, New York University, 2004, which analyzes classification performance with various configurations of local classifiers, relational classifiers, and collective inference methods for propagating evidence through the graph. Also see D. Jensen, J. Neville, and B. Gallagher, Why Collective Inference Improves Relational Classification, in KDD'04, 2004, for a related study. Methods originating in inductive logic programming have also been applied to classification with hyperlinks. See M. Craven and S. Slattery, Relational Learning with Statistical Predicate Invention: Better Models for Hypertext, Machine Learning, 43:97-119, 2001, for the use of a combination of FOIL and Naive Bayes for classification in the WebKB data.
Many of these link-based relational learning models may either implement a procedure that does not solve an optimization problem and, consequently, such procedures do not necessarily converge, or may require approximate Bayesian inference due to the non-convexity of the underlying Bayesian formulation. A different approach is needed for combining link and text information that leads to a well-formed convex optimization solution that can be efficiently computed. Although some theoretical aspects of combining link and text information were discussed in recent work, the theoretical combinations discussed fail to lead to implementable algorithms suitable for large scale text classification problems. See for instance, A. Argyriou, M. Herbster, and M. Pontil, Combining Graph Laplacians for Semi-supervised Learning, In NIPS'05, 2006.
What is needed is a system and method for combining link and text information in an implementable solution suitable for large scale text classification problems. Such a system and method should be able to train a classifier for classifying very large numbers of documents such as web pages accessible through the World Wide Web for online applications.