The Internet allows users to access millions of electronic documents, such as electronic mail messages, web pages, memoranda, design specifications, electronic books, and so on. Because of the large number of documents, it can be difficult for users to locate documents of interest. To locate a document, a user may submit search terms to a search engine. The search engine identifies documents that may be related to the search terms and then presents indications of those documents as the search result. When a search result is presented, the search engine may attempt to provide a summary of each document so that the user can quickly determine whether a document is really of interest. Some documents may have an abstract or summary section that can be used by the search engine. Many documents, however, do not have abstracts or summaries. The search engine may automatically generate a summary for such documents. The usefulness of the automatically generated summaries depends in large part on how effectively a summary represents the main concepts of a document.
Many traditional information retrieval summarization algorithms have been adapted to automatically generate summaries of web pages from their content. For example, Luhn proposed an algorithm that calculates the significance of a sentence to a document based on keywords of the document that are contained within the sentence. Luhn's algorithm selects the sentences with the highest significance to form the summary of the document. As another example, latent semantic analysis (“LSA”) algorithms generate an LSA score for each sentence of a document using singular value decomposition. The sentences with the highest score are selected to form the summary of the document. Unfortunately, the summaries generated by the adaptation of these conventional algorithms to web pages are not particularly accurate summaries of the web pages. The main reason for the inaccuracies in the summaries may be that many web pages contain content directed to different topics (e.g., different news articles and advertisements). Many conventional algorithms, in contrast, were designed to generate a summary of a document having a primary topic.
More recent algorithms use the hyperlink structure of the web to generate more accurate summaries of web pages. In particular, many of these techniques use the content of the web pages that link to a web page to generate a summary for that web page. The underlying assumption is that a web page author who includes a link in their web page is likely to provide an accurate (albeit possibly short) summary of the content of a referenced web page. These hyperlink-based algorithms may use the text of the hyperlink itself and the text surrounding the hyperlink to generate a summary. Some algorithms that use the text surrounding the hyperlink may extract a certain number of words (e.g., 25) before and after a hyperlink or may extract a complete sentence or paragraph surrounding a hyperlink.
These hyperlink-based or anchor-based algorithms, however, have difficulty distinguishing hyperlinks with surrounding text that accurately describes the referenced web page from those that do not. For example, a web page may contain the sentence “Today, I visited the <link>White House</link> with my mother.” The text surrounding this link, however, provides an inaccurate description of a web page for the White House. As a result, these hyperlink-based algorithms often generate summaries that are inaccurate.