Many search engine services, such as Google and Overture, provide for searching for information that is accessible via the Internet. These search engine services allow users to search for display pages, such as web pages, that may be of interest to users. After a user submits a search request that includes search terms, the search engine service identifies web pages that may be related to those search terms. To quickly identify related web pages, the search engine services may maintain a mapping of keywords to web pages. This mapping may be generated by “crawling” the web (i.e., the World Wide Web) to identify the keywords of each web page. To crawl the web, a search engine service may use a list of root web pages to identify all web pages that are accessible through those root web pages. The keywords of any particular web page can be identified using various well-known information retrieval techniques, such as identifying the words of a headline, the words supplied in the metadata of the web page, the words that are highlighted, and so on. The search engine service may generate a relevance score to indicate how relevant the information of the web page may be to the search request based on the closeness of each match, web page popularity (e.g., Google's PageRank), and so on. The search engine service then displays to the user links to those web pages in an order that is based on their rankings.
Although search engine services may return many web pages as a search result, the presenting of the web pages in rank order may make it difficult for a user to actually find those web pages of particular interest to the user. Since the web pages that are presented first may be directed to popular topics, a user who is interested in an obscure topic may need to scan many pages of the search result to find a web page of interest. To make it easier for a user to find web pages of interest, the web pages of a search result could be presented in a hierarchical organization based on some classification or categorization of the web pages. For example, if a user submits a search request of “court battles,” the search result may contain web pages that can be classified as sports-related or legal-related. The user may prefer to be presented initially with a list of classifications of the web pages so that the user can select the classification of web pages that is of interest. For example, the user might be first presented with an indication that the web pages of the search result have been classified as sports-related and legal-related. The user can then select the legal-related classification to view web pages that are legal-related. In contrast, since sports web pages are more popular than legal web pages, a user might have to scan many pages to find legal-related web pages if the most popular web pages are presented first.
It would be impractical to manually classify the millions of web pages that are currently available. Although automated classification techniques have been used to classify text-based content, those techniques are not generally applicable to the classification of web pages. Web pages have an organization that includes noisy content, such as an advertisement or a navigation bar, that is not directly related to the primary topic of the web page. Because conventional text-based classification techniques would use such noisy content when classifying a web page, these techniques would tend to produce incorrect classifications of web pages.
It would be desirable to have a classification technique for web pages that would base the classification of a web page on the primary topic of the web page and give little weight to noisy content of the web page.