With the widespread adoption of the World Wide Web, more and more hyper-text documents become available on the Web. Some examples of such data include organization and personal web pages (e.g, the WebKB benchmark data set, which contains university web pages), research papers (e.g., data in CiteSeer), online news articles, and customer-generated media (e.g., blogs). Comparing to data in traditional information management, in addition to content, the data on the Web also contain links such as hyperlinks from a student's homepage pointing to the homepage of her advisor, paper citations, sources of a news article, and comments of one blogger on posts from another blogger, among others. Performing information management tasks on such structured data raises many challenges.
For the classification problem of web pages, a simple approach treats web pages as independent documents. The advantage of this approach is that many off-the-shelf classification tools can be directly applied to the problem. However, this approach relies only on the content of web pages and ignores the structure of links among them.
Link structures provide invaluable information about properties of the documents as well as relationships among them. For example, in the WebKB dataset, the link structure provides additional insights about the relationship among documents (e.g., links often pointing from a student to her advisor or from a faculty member to his projects). Since some links among these documents imply the inter-dependence among the documents, the usual i.i.d. (independent and identical distributed) assumption of documents does not hold. Hence, the traditional classification methods that ignore the link structure may not be suitable.
On the other hand, it is difficult to rely only link structures and ignore content information. For example, in the Cora dataset, the content of a research article abstract largely determines the category of the article. To improve the performance of web page classification, therefore, both link structure and content information should be taken into consideration. To achieve this goal, a simple approach is to convert one type of information to the other. For example, in spam blog classification, outlinks from a blog have been treated as a feature similar to the content features of the blog. In document classification, content similarity among documents has been converted into weights in a Markov chain. However, link and content information have different properties. For example, a link is an actual piece of evidence that represents an asymmetric relationship whereas the content similarity is usually defined conceptually for every pair of documents in a symmetric way. Therefore, directly converting one type of information to the other usually degrades the quality of information. On the other hand, an approach that simply considers link information and content information separately and then combines them in an ad hoc way ignores the inherent consistency between link and content information and therefore fails to combine the two seamlessly.
Link information has been incorporated using co-citation similarity, but this may not fully capture the link structure. In FIG. 1, for example, web pages B and E co-cite web page C, implying that B and E are similar to each other. In turns, A and D should be similar to each other, since A and D cite similar web pages B and E, respectively. But using co-citation similarity, the similarity between A and D is zero without considering other information. In short, the world wide web contains rich textual contents that are interconnected via complex hyperlinks. This huge database violates the assumption held by most of conventional statistical methods that each web page is considered as an independent and identical sample. It is thus difficult to apply traditional mining or learning methods for solving web mining problems, e.g., web page classification, by exploiting both the content and the link structure. Conventional methods typically exploit the link structure or the content information alone, or combine them both in an ad-hoc manner.