The tremendous growth of the Web (World Wide Web Internet service) over the past few years has made a vast amount of information available to users. However, the sheer scale of the Web has also posed several problems in organizing, retrieving and utilizing this information. Web Directories such as Yahoo!™, Lycos™ and Netscape's dmoz have been very useful for searching this vast amount of information. Directory based methods of searching on the Web can give better precision but a low recall when compared with search engines. One reason for low recall is that the directories are typically manually created and maintained. The human effort required for classifying material and maintaining the directories up-to-date cannot keep pace with the exponential growth of the Web. Therefore, automatic categorization of Web-based information resources into these directories is required. Consequently, several machine learning techniques have been applied to the task of categorization of Web sites. Web directories are also helpful in many other information retrieval applications such as information filtering and information extraction.
Several techniques have been developed to generate hierarchies from the search results returned by search engines. The generated hierarchies present an overview of the returned results and provide a mechanism to refine the search results. These techniques employ clustering algorithms to generate the hierarchies. Several other Web mining applications, such as query disambiguation, also require the clustering of Web documents.
Web mining applications typically represent Web documents in terms of certain features and then use machine learning techniques such as classification and clustering for further processing. Several approaches have used the text content of the Web documents for this classification, but text classifiers that perform quite well on simple text documents have been observed to perform poorly on Web documents. Web documents have structured information which is nested inside different HTML tags, and contain hyperlinks. These documents therefore have much more information than simple text documents. On the other hand, they also contain a lot of “noise” due to poor editing and spamming. Consequently, other features have been suggested for the purpose of classification and clustering of Web documents.
S. Chakrabarti et al, “Enhanced hypertext categorization using hyperlinks”, Proceedings of the ACM International Conference On Management Of Data (ACM-SIGMOD), pages 307-318, Seattle, US, 1998, discloses the use of hyperlinks on a set of Web pages and US patent documents in which citations were considered as links. Texts from neighboring documents and system-predicted class categories (topics) for the neighboring documents were used for document classification. The ‘neighbouring’ documents were the documents having links to the document under consideration. Significant improvement was observed over a ‘baseline’ reference case that relied only on the text content of target documents. Chakrabarti et al also tested a more naive way of using the linked documents, treating the words in the linked documents as if they were local, but this approach increased the error rate over the baseline case.
E. J. Glover et al, “Using Web Structure for Classifying and Describing Web Pages”, Proceedings of the International World Wide Web Conference (WWW2002), Honolulu, Hi, 2002, disclosed use of anchor text and text neighboring the anchor text for classifying a target document after performing entropy-based feature selection. The anchor text of a link is the displayed text between <A HREF=“URL”> and </A> tags.
J. M. Pierre, “Practical Issues for Automated Categorization of Web Sites”, September 2000, discloses the use of meta-tags appearing on Web pages for hypertext classification. Pierre suggests that meta-tags, whenever present, are more useful than the text content of the pages for classification. At the time of writing this patent specification, an online version of Pierre's paper is available from: www.ics.forth.gr/is1/SemWeb/proceedings/session3-3/html_version/semanticweb.html
G. Attardi et al, in “Automatic Web Page Categorization by Link and Context Analysis”, Proceedings of 1st European Symposium on Telematics, Hypermedia and Artificial Intelligence, 1999 (THAI-99), pages 105-119, propose an approach for document classification where a Web document is represented by the anchor text of the links that refer to the given document along with some contextual ‘hints’ extracted from the document that includes the link (hints such as page title, section title and list descriptions). The content of the Web document is ignored. Attardi et al report improved accuracy for classification of Web pages.
Although representation of documents using anchor text and some specific ‘contextual’ text has been proposed in the art, there remains a need for improved methods for characterizing and generating representations of Web-based information resources, such as for categorizing Web pages and for efficient information retrieval.