This invention relates to word processors, and more particularly, to document summarizers for word processors.
Many people are faced with the daunting task of reading large amounts of electronic textual materials. In the computer age, people are inundated with papers, memos, e-mail messages, reports, web pages, schedules, reference materials, test results, and so on. Unfortunately, many documents do not begin with summaries. Creation of summaries is tedious, requiring the author to re-read the document, identify major themes, and distill the main points of the document into a concise summary. Most authors never bother.
Summarizing a document is even more difficult and time-consuming for a reader. The reader must first read the entire document (or at least skim it) to understand the contents. The reader must then attempt to extract the document""s key points from unimportant details.
The problems associated with handling large volumes of un-summarized documents are particularly acute for MIS (Management Information Systems) personnel. These individuals are confronted daily with tasks of organizing, managing, and retrieving documents from large databases. Imagine this typical scenario. An MIS staff member receives a cryptic request to locate all documents that pertain to a topic believed to have been discussed in a several company memos written about three to four years ago. To accommodate this search request, the MIS staff member must first perform a word search for the topic, and then laboriously peruse each hit document in an effort to find the mysterious memos. Without summaries, the staff member is forced to read large portions, if not all, of each document before concluding whether the document is relevant or irrelevant. Being forced to read unnecessary text leads to many wasted hours of the staff member""s time.
The problem is less critical, but still troubling, for individual users who are browsing through the Internet or other network to find documents on a related topic. Upon locating a document, the user must either read the document online to determine whether it is relevant (at the cost of additional online expenses), or download the document for later review (at the risk of retrieving an irrelevant document).
To help address these problems, computer-implemented document summarizers have been developed to automatically summarize text-based documents for the readers. The document summarizers examine an existing document, and attempt to create an abstract or summary from the existing text.
Early development on document summarizers centered on statistical approaches to creating summaries. One statistical approach is described in an article by H. P. Luhn, entitled xe2x80x9cThe Automatic Creation of Literature Abstracts,xe2x80x9d which was published April 1958 in the IBM Journal at pages 159-165. The Luhn technique assigns to each sentence a xe2x80x9csignificancexe2x80x9d factor derived from an analysis of its words. This factor is computed by ascertaining a cluster of words within a sentence, counting the number of significant words contained in the cluster, and dividing the square of this number by the total number of words in the cluster. The sentences are then ranked according to their significance factor, with one or several of the highest ranking sentences being selected to form the abstract.
Most, if not all, of the document summarizers in use today appear to employ the Luhn technique. Examples of such summarizers include a Text Summariser from BT (formerly British Telecom), Visual Recall from Xsoft Corporation (a subsidiary of Xerox), and In Text from Island Software.
Another approach to summarizing documents is described in an article by Kenji Ono, et al., entitled xe2x80x9cAbstract Generation Based on Rhetorical Structure Extraction,xe2x80x9d which was published in Proceedings of the 15th International Conference on Computational Linguistics. Vol. 1, at pages 344-348, for a conference held Aug. 5-9, 1994 in Kyoto, Japan. Their approach involved a linguistic analysis which constructed rhetorical structures representing relations between various chunks of sentences in the body of the section. The rhetorical structure is represented by two levels: intra-paragraph, which analyzes the text according to sentence units, and inter-paragraph, which analyzes the text using paragraph units. Extraction of the rhetorical structure is accomplished using a detailed and sophisticated five-step procedure. The Ono technique is unnecessarily complicated for many situations where a rudimentary summary is all that is desired.
In addition, this technique is highly genre-dependent, producing good summaries only when the text is rich in superficial markers of its discourse structure. It thus works relatively well on the academic prose examined by Ono et al., but will fail on documents written in less formal prose.
When the summaries are created, conventional document summarizers present the results to the reader in one of two formats. The first format is to underline or otherwise highlight the sentences that are deemed to be part of the summary. The second format is to show only the abstracted sentences in paragraph or bullet format, without the accompanying text of the document.
One common problem with the conventional document summarizers is that they are reader-based. These summarizers do not consider summary creation and presentation from the perspective of the author.
Accordingly, there remains a need to provide an author-oriented summarizer for a word processor that helps authors automatically create summaries for their writings, and one which will produce a summary for any text which is presented to it.
This invention concerns a document summarizer which is particularly helpful in assisting authors in preparing summaries for documents, as well aiding readers in their review of un-summarized documents. For a given text, the document summarizer first performs a statistical analysis to generate a list of ranked sentences for consideration in the summary. The summarizer counts how frequently content words appear in a document and produces a table correlating the content words with their corresponding frequency counts. A sentence score for each sentence is derived by summing the frequency counts of the content words in the sentence and dividing that sum by the number of the content words in the sentence. The sentences are then ranked in order of sentence scores, with higher ranking sentences having comparatively higher sentence scores and lower ranking sentences having comparatively lower sentence scores.
Concurrent with the statistical analysis in the same pass through the document, the document summarizer performs a cue-phrase analysis by consulting a pre-compiled list of words and phrases which serve either as indicators of discourse relationships between adjacent sentences in a document or as an indicator of the overall importance of a particular sentence in a document. The cue-phrase analysis compares the sentence string to this pre-compiled list of cue phrases. Associated with each cue phrase are conditions which are used to determine whether a sentence containing that cue phrase will be used in a summary.
For instance, the list might contain words and phrases which depend on the surrounding context of the document to properly understand the sentence. A sentence that begins, xe2x80x9cThat is why . . . xe2x80x9d or xe2x80x9cIn contrast to this . . . ,xe2x80x9d depends on statements made in the preceding sentence(s). The summarizer establishes a condition that a sentence containing a dependent word or phrase may only be included in the summary if the neighboring context from which the word or phrase depends is also included in the summary.
The pre-compiled list also contains cue phrases whose presence in a sentence will result in that sentence being excluded from the summary, no matter how large its statistically-derived score might be. For instance, a sentence which contains the phrase xe2x80x9cas shown in Fig. . . . xe2x80x9d should not be included in a summary because the referenced figure will not be present.
Following the statistical and cue-phrase analysis phases, the summarizer creates a summary containing the higher ranked sentences. The summary may also include a conditioned sentence (such as one that contains a dependent word or phrase) if the conditions established for inclusion of the sentence have been satisfied. However, the summary never includes prohibited sentences.
The summarizer inserts the sentence at the beginning of the document before the start of the text, or in a new document, based on the user""s choice. This placement is convenient and useful to the author. The author is then free to revise the summary as he/she wishes.