Many search engine services, such as Google and Yahoo, provide for searching for information that is accessible via the Internet. These search engine services allow users to search for display pages, such as web pages, that may be of interest to users. After a user submits a search request (i.e., a query) that includes search terms, the search engine service identifies web pages that may be related to those search terms. To quickly identify related web pages, the search engine services may maintain a mapping of keywords to web pages. This mapping may be generated by “crawling” the web (i.e., the World Wide Web) to identify the keywords of each web page. To crawl the web, a search engine service may use a list of root web pages to identify all web pages that are accessible through those root web pages. The keywords of any particular web page can be identified using various well-known information retrieval techniques, such as identifying the words of a headline, the words supplied in the metadata of the web page, the words that are highlighted, and so on. The search engine service identifies web pages that may be related to the search request based on how well the keywords of a web page match the words of the query. The search engine service then displays to the user links to the identified web pages in an order that is based on a ranking that may be determined by their relevance to the query, popularity, importance, and/or some other measure.
Whether the web pages of a search result are of interest to a user depends, in large part, on how well the keywords identified by the search engine service represent the primary topic of a web page. Because a web page may contain many different types of information, it may be difficult to discern the primary topic of a web page. For example, many web pages contain advertisements that are unrelated to the primary topic of the web page. A web page from a news web site may contain an article relating to an international political event and may contain “noise information” such as an advertisement for a popular diet, an area related to legal notices, and a navigation bar. It has been traditionally very difficult for a search engine service to identify what information on a web page is noise information and what information relates to the primary topic of the web page. As a result, a search engine service may select keywords based on noise information, rather than the primary topic of the web page. For example, a search engine service may map a web page that contains a diet advertisement to the keyword “diet,” even though the primary topic of the web page relates to an international political event. When a user then submits a search request that includes the search term “diet,” the search engine service may return the web page that contains the diet advertisement, which is unlikely to be of interest to the user.
Many information retrieval and mining applications, such as search engine services as described above, depend in part on the ability to divide a web page into blocks and classify the functions of the blocks. These applications include classification, clustering, topic extraction, content summarization, and ranking of web pages. The classification of the function of a block can also be used in fragment-based caching in which caching policies are based on individual fragments. The classification of the function of blocks can also be used to highlight blocks that may be of interest to users. The classification of the function of blocks is particularly useful when a web page is displayed on a screen with a small size, such as that of a mobile device.