The subject application generally relates to content within linked web documents. In one particular example, unlabeled content within the linked documents is categorized via co-trained label expansion. It will be appreciated that the described techniques may find application in other systems and/or other methods.
In today's information age, individuals have access to a quantity and breadth of information never before possible. The information can be presented via interlinked web pages containing articles posted by users, which are accessed via the Internet. The articles can contain a wide range of content including text, images, video, etc. related to particular topics. Each article can be assigned one or more metadata tags to indicate a particular topic and/or subject matter related to the content within the article.
In this manner, articles can be labeled based at least in part upon such tags to facilitate subsequent organization and retrieval thereof. Manual labeling of content, however, is both time consuming and expensive. Thus, labeled content generally represents only a fraction of the total amount of information available on the Internet in general and linked documents in particular. If information is not labeled, alternative and generally inefficient search methods can be employed to try and identify relevant information.
In one example, a search engine is used as a low cost alternative although results may be difficult or impossible to navigate. For instance, one web page with relevant content can be identified along with hundreds of other web pages containing irrelevant content. This problem is exacerbated by the voluminous sources of information available at an enormous number of web sites. This number continues to grow at a rate of around 60 million new pages annually. Such growth makes it impractical for all the information to be continuously reviewed and appropriately labeled. Thus, much of this content is uncategorized and therefore can be cumbersome to access.
This can also be true for information within linked documents. Linked documents contain hyperlinks within text of one document that is expounded upon in a disparate document. Such interlinking can provide a convenient cross-reference to content/terms referred to within an article. Wikipedia is a popular example of a linked documents and accounts for about 10 million articles written collaboratively by volunteers around the world. Almost all of the articles are created and revised by users who access the Wikipedia website according to certain policies and guidelines. Much of this content can remain uncategorized as the number of articles and contributors greater than the resources available to categorize such information. Thus, it is difficult if not impossible to identify the content within all the Wikipedia articles.
Systems and methods are needed to categorize content, such as linked documents, available on the Internet to facilitate trouble-free access of relevant information.