The Internet allows users to access millions of electronic documents, such as electronic mail messages, web pages, memoranda, design specifications, electronic books, and so on. Because of the large number of documents, it can be difficult for users to locate documents of interest. To locate a document, a user may submit search terms to a search engine. The search engine identifies documents that may be related to the search terms and then presents indications of those documents as the search result. When a search result is presented, the search engine may attempt to provide a summary of each document so that the user can quickly determine whether a document is really of interest. Some documents may have an abstract or summary section that can be used by the search engine. Many documents, however, do not have abstracts or summaries. The search engine may automatically generate a summary for such documents. The automatic text summarization techniques can be used to summarize many different types of documents other than web pages such as business reports, political reports, technical reports, news articles, chapters of books, and so on. The usefulness of the automatically generated summaries depends in large part on how effectively a summary represents the main concepts of a document.
Many different algorithms have been proposed for automatic text summarization. Luhn proposed an algorithm that calculates the significance of a sentence to a document based on keywords of the document that are contained within the sentence. In particular, a Luhn-based algorithm identifies a portion of the sentence that is bracketed by keywords that are not more than a certain number of non-keywords apart. The significance of a sentence as calculated by a Luhn-based algorithm is a score that reflects the density of keywords within the bracketed portion. The Luhn-based algorithm may calculate the score of a sentence as the ratio of the square of the number of keywords contained in the bracketed portion divided by the number of words within the bracketed portion. The sentences with the highest scores are selected to form the summary. (See H. P. Luhn, The Automatic Creation of Literature Abstracts, 2 IBM J. OF RES. & DEV. No. 2, 159-65 (April 1958).)
Other summarization algorithms use latent semantic analysis (“LSA”) to generate an LSA score for each sentence of a document. A latent semantic analysis summarization technique uses singular value decomposition to generate a score for each sentence. An LSA summarization technique may generate a word-sentence matrix for the document that contains a weighted term-frequency value for each word-sentence combination. The matrix may be represented by the following:A=UΣVT  (1)where A represents the word-sentence matrix, U is a column-orthonormal matrix whose columns are left singular vectors, Σ is a diagonal matrix whose diagonal elements are non-negative singular values sorted in descending order, and V is an orthonormal matrix whose columns are right singular vectors. After decomposing the matrix into U, Σ, and V, an LSA summarization technique uses the right singular vectors to generate the scores for the sentences. The sentences with the highest scores are selected to form the summary. (See Y. H. Gong & X. Liu, Generic Text Summarization Using Relevance Measure and Latent Semantic Analysis, in PROC. OF THE 24TH ANNUAL INTERNATIONAL ACM SIGIR, New Orleans, La., 19-25(2001).)
Although both Luhn and LSA summarization techniques generally generate effective summaries, it is difficult to accurately assess, either objectively or subjectively, the effectiveness of a summary at representing the main concepts of the document.