1. Field of the Invention
The invention relates to methods and apparatus for automatically producing summaries of documents and, in particular, utilizing simple natural language processing and relying on statistical properties of text to produce such summaries.
2. Description of Related Art
Document summaries assist in the review of documents because entire documents do not need to be read. Additionally, document summaries can reduce translation costs when reviewing one or more documents in a foreign language because only the document summary--not the entire document--needs to be translated. After reviewing the translated document summary, the reviewer can determine whether the entire document should be translated.
When documents are not provided with a summary, it is necessary to produce a summary of the document to obtain the benefits discussed above. It is desirable to produce document summaries automatically so that people do not have to read an entire document to produce a summary. Such automatic summaries should accurately reflect the main theme(s) in the document to assist people in deciding correctly whether to read (and/or have translated) the entire document.
Two basic computational techniques exist for automatic document summarization. The first intensively uses natural language processing and semantic network creation. The second uses simple natural language processing and then relies on statistical properties of the text.
The first technique is computationally expensive. Additionally, creating semantically correct summaries is difficult and error prone. Typically, a domain must be known in advance in order to perform adequate semantic modeling. Such techniques may not be used on ordinary text, unrestricted in content.
"Automatic Text Processing"; Gerald Salton; Addison-Wesley; 1989 discloses a summarization technique of the second type. Text words from a corpus of documents are isolated. Words used in titles, figures, captions and footnotes are flagged as title words. The frequency of occurrence of the remaining text words within the document corpus is determined. Word weights are determined based on the location and frequency of occurrence of the words in individual documents and in the document corpus. Phrase weights are determined for coocurring words (phrases) in sentences of a document. The sentences in each document are then scored based on the weights of the words and phrases in each sentence. A number of topscoring sentences are then selected from a document to produce a summary having a predetermined length.
The technique described-by Salton uses a corpus of documents to calibrate the weights of the words and phrases. Thus, the term weights are not customized to an individual document. This could result in the inclusion of sentences in a document's summary that do not assist in describing a theme of the document. Additionally, an overly-narrow cross section of the document may be extracted because the technique of Salton creates the document summary from a single pass through the document. The selection of only the highest scoring sentences may lead to a disjointed summary that does not convey a cogent theme.