1. Field of Invention
This invention is related to generating text summaries and to compressing text summaries generated by other text compression or summary generating systems and methods.
2. Description of Related Art
Users of information systems must typically absorb large amounts of information to accomplish their information acquisition goals.
In response, vendors of corporate information systems have attempted to increase the efficiency of business processes by providing greater quantities of information to users ever more quickly. As the number of information suppliers and frequency of reporting has increased, users find themselves deluged with information.
In response, commercial, academic and government researchers have developed summary generating systems to decrease the amount of information that a user must absorb. Summarizing documents or content portions will always result in a loss of information. However, successful summarization requires choosing information to delete that minimizes information loss while preserving the meaning of the remaining portions and maximizing the meaning of the resultant summary.
For example, conventional summary generating methods may pick out isolated words or phrases from a text and print them out in sequential order. These conventional summary generating methods give an indication of some of the entities or events described by the text, but neither the point of the text nor the meaning of the individual words or phrases in context will be recoverable. The structure and readability of the original sentence is not preserved and will frequently contain unresolved and/or incorrectly resolved pronouns and other referential items. Since these conventional summary generating methods affect the grammar of sentences in the text, the readability of the text is degraded. These conventional summary generating systems may also omit punctuation in phrases or sentences making the summary difficult to understand. Conventional summary generating methods select sentences for inclusion in the summary based on statistical criteria including such information as position of a sentence in a paragraph, the position of a paragraph in a document as well as statistical information about the frequency of co-location patterns of lexical items in the document. Therefore the selected sentences do not necessarily follow each other coherently. Referential integrity is not necessarily preserved which may result in referential ambiguities. The resulting summary is therefore difficult to read. Conventional methods that use sentence extraction and keyword extraction techniques are better able to produce informative summaries. However such methods pose problems of how to choose the sentences or phrases to extract.
Corston-Oliver describes several text compaction methods that operate on a sentence by sentence level in “Text Compaction for Display on Very Small Screens”, Corston-Oliver, S., in North American Chapter of the Association for Computational Linguists (NAACL) 2001 Language Technologies Workshops Jun. 3–4, 2001. These methods include language dependent character removal strategies, white space compaction using initial word capitalization and the normalization of items such as company names, dates, personal proper nouns and numbers.
Another conventional summarization system is described in “Producing Intelligent Telegraphic Text Reduction to Provide an Audio Scanning Service for the Blind” in Intelligent Text Summarization, AAAI Spring Symposium Series, Stanford, Calif., 1998, p. 111–117. However, these text compaction strategies do not preserve the grammaticality of the sentences of the text, which would make the result more readable. This property of texts is referred to as grammatical readability.
Many of these problems are addressed in commonly assigned copending U.S. patent application Ser. No. 09/689,779, entitled “System and Method for Generating Text Summaries”, incorporated herein by reference in its entirety. In the '779 application, a structural representation of discourse according to a theory of discourse analysis is created, a rank is determined and nodes having a rank less than or equal to the determined rank are output as a summary. Summaries are provided based on the selective display of text building units from a structural representation of discourse. The techniques discussed in the '779 application preserve referential integrity, coherency and punctuation. However, these techniques for generating text summaries cannot generate summaries shorter than the actual lengths of the highest ranked text building units in the structural representation of discourse.