In information retrieval, text summarization is widely used, and helps users to quickly evaluate the relevance of documents or to navigate through a corpus. Basically, the text summarization methods can be categorized into the following four approaches: listing the first natural paragraph or a number of sentences at the beginning of an article as a summary (e.g., infoseek, Yahoo!, etc.): this method is very simple, but it cannot give an overview of the article; listing the sentences matched with the search query (Lotus' site, Beijing Daily's site, etc.): this method generate a summary which directly relates to the search query, and it cannot give an overview of the article either; using a template: this method searches for some patterns in a document and fills the matched contents into a pre-defined template. This method can generate a very coherent summary, but it is only applicable to fixed styles and fixed fields, and is difficult to be generalized; identifying the most important clauses/sentences/paragraphs by using statistical techniques. Generally, this method comprises four steps (here we take identifying important sentences for example): (1) analyzing the section and chapter structure of a document and segmenting it into paragraphs and sentences; (2) dividing the sentences into words; (3) evaluating the importance of each of the words and sentences; and (4) outputting the sentences with higher evaluation scores as a summary of the document.
Among the above methods, the statistics based method first extracts representative document segments using statistical techniques, then outputs the representative document segments with higher evaluation scores as the summary of the document. The summary thus generated by this method can better represent the content of the article, and for this reason this method has been widely used.
In most cases, however, sentences with higher evaluation scores are usually scattered in various parts of the document, and may not directly relate to one another, therefore the readability of the summary formed by simply concatenating the sentences with higher evaluation scores is usually relatively poor.