1. Field of the Invention
The invention relates to a method for the automatic generation of a summary of a text by means of a computer.
2. Description of the Related Art
From the European Patent document EP 0 751 470 A1, a method for the automatic summarization of a text is known. Feature probabilities are thereby determined that enable an automatic summarization.
Today, it is difficult and strenuous to select, from a flood of information, the information that is important according to predeterminable personal criteria. Even after the selection, a nearly inexhaustible mass of data, e.g. in the form of articles, is often provided. Since with the aid of a computer it is easy to acquire and manage large quantities of data, the idea suggests itself of using the computer also for the preparation or, respectively, for the selection of information. Such an automatic reduction of information should make it possible for a user to have to read a significantly smaller amount of data in order to obtain the information that is relevant for the user.
A particular type of information reduction is the summarization of texts.
From the publication by J. Kupiec, J. Pedersen and F. Chen, xe2x80x9cA Trainable Document Summarizer,xe2x80x9d Xerox, Palo Alto Research Center, 1995, a method is known for the summarization of texts that uses heuristic features with a discrete value range. The probability that a sentence from the text belongs to the summary, under the condition that a heuristic feature has a particular value, is estimated from a training set of summaries.
An object of the invention is the automatic generation of a summary from a predetermined text, whereby this summary should reproduce in short form the essential content of the text.
This and other objects and advantages of the invention are achieved by a method for the automatic generation of a summary of a text by a computer, in which for each sentence a probability is calculated that the sentence belongs to the summary, in that, for each word in the sentence, the relevance measure is determined from a lexicon that contains application-specific words with a predetermined relevance measure for each of these words, and all relevance measures cumulatively yield the probability that the sentence belongs to the summary; all sentences of the text are sorted according to the probabilities; and corresponding to a predeterminable reduction measure, for the summarization the best sentences are displayed in a sequence given by the text.
The inventive method enables a summarization of a text in that for each sentence of this text a probability that the sentence belongs to the summary is calculated. For each word in the sentence, the relevance measure is determined from a lexicon that contains all relevant words, with a predetermined relevance measure for each of these words. The cumulation of all relevance measures yields the probability that the sentence belongs to the summary. All sentences are thereupon sorted according to their probability. A predeterminable reduction measure that indicates what percentage of the original text is represented in the summary serves for the selection of the number of sentences given by this reduction measure from the sorted representation. If the most important x percent of sentences have been selected, these are displayed as a summary of the text in their original sequence given by this text.
An advantageous development of the inventive method is the introduction of an individual word frequency in addition to the relevance measure. This individual word frequency indicates how often the respectively indicated word occurs in the entire text to be summarized. Taking into account the relevance measure and this newly introduced individual word frequency, the probability that the respective sentence is contained in the summary can be indicated by the following rule:                               WK                      (            sentence            )                          =                              1            N                    ·                                    ∑                              i                =                1                            N                        ⁢                          tf              ·              rlv                                                          (        1        )            
whereby
WK(sentence) is the probability that a sentence belongs to the summary,
N is the total number of words that occur in the sentence,
i is a count variable (i=1,2, . . . , N) for all the words in the sentence,
tf is the frequency of the occurrence of the respective word under consideration in the entire text being summarized (individual word frequency), and
rlv is the relevance measure for the respective word in the sentence.
Let it be hereby noted that the words occurring in the lexicon with their relevance measure rlv, known from the lexicon, are decisive. If a word that does not occur in the lexicon occurs n times, this word does not increase the probability that the sentence belongs to the summary.
A development of the inventive method is the use of an application-specific lexicon. This has the result that the summary is carried out with a predeterminable subject-matter-specific filter. Thus, for example, a lexicon specified for sport articles will, in a text to be summarized, evaluate sport-related words with a higher relevance than a lexicon that is specialized for summaries of economics contributions. It is thus advantageously possible to provide specific knowledge concerning predeterminable categories by means of lexica corresponding to the respective categories.
In addition, it is advantageous to allocate a text to one or more categories. This can be carried out automatically by using specific predeterminable words in the theme-related lexica as selection criteria for an allocation to the respective subject area. If several categories (subject areas), i.e. various viewpoints or, respectively, filters, are possible for the summarization of a text, different summaries xe2x80x94one for each categoryxe2x80x94can be produced automatically.
In the present application, the terms application-specific lexicon, subject-related lexicon, and theme-related lexicon are used as alternative terms for the lexicon according to the invention.