Many search engine services, such as Google and Overture, provide for searching for information that is accessible via the Internet. These search engine services allow users to search for display pages, such as web pages, that may be of interest to users. After a user submits a search request that includes search terms, the search engine service identifies web pages that may be related to those search terms. To quickly identify related web pages, the search engine services may maintain a mapping of keywords to web pages. This mapping may be generated by “crawling” the web (i.e., the World Wide Web) to identify the keywords of each web page. To crawl the web, a search engine service may use a list of base web pages to identify all web pages that are accessible through those base web pages. The keywords of any particular web page can be identified using various well-known information retrieval techniques, such as identifying the words of a headline, the words supplied in the metadata of the web page, the words that are highlighted, and so on. The search engine service may generate a relevance score to indicate how related the information of the web page may be to the search request. The search engine service then displays to the user links to those web pages in an order that is based on their relevance.
Although search engine services may return many web pages as a search result, the presenting of the web pages in relevance order may make it difficult for a user to actually find those web pages of particular interest to the user. Since the web pages that are presented first may be directed to popular topics (e.g., when the ordering is based on Google's PageRank), a user who is interested in an obscure topic may need to scan many pages of the search result to find a web page of interest. To make it easier for a user to find web pages of interest, the web pages of a search result could be presented in a hierarchical organization based on some classification or categorization of the web pages. For example, if a user submits a search request of “court battles,” the search result may contain web pages that can be classified as sports-related or legal-related. The user may prefer to be presented initially with a list of classifications of the web pages so that the user can select the classification of web pages that is of interest. For example, the user might be first presented with an indication that the web pages of the search result have been classified as sports-related and legal-related. The user can then select the legal-related classification to view web pages that are legal-related. In contrast, since sports web pages are more popular than legal web pages, a user might have to scan many pages to find legal-related web pages if the most popular web pages are presented first. Alternatively, the user may be presented with a hierarchy of classifications. The user may select a classification when the user submits a search request. In this case, the search engine would limit the search to web pages within the selected classification.
It would be impractical to manually classify the millions of web pages that are currently available. Although automated classification techniques have been used to classify text-based content, those techniques are not generally applicable to the classification of web pages. Web pages have an organization that includes noisy content, such as an advertisement or a navigation bar, that is not directly related to the primary topic of the web page. Because conventional text-based classification techniques would use such noisy content when classifying a web page, these techniques would tend to produce incorrect classifications of web pages. Moreover, although many attempts have been made to classify web pages, they have generally not been able to effectively classify web pages into hierarchical classifications. A major reason for the inability to effectively classify the web pages is that some of the classifications have very few web pages. Because of the sparseness of web pages in certain classifications, it can be difficult to identify a large enough training set of web pages for training of a classifier for those classifications.