1. Field of the Invention
The present invention relates to a document summarizing apparatus, a document summarizing method and a recording medium storing a document summarizing program, more specifically to a document summarizing apparatus, a document summarizing method and a recording medium storing a document summarizing program for creating a summary holding the overview of a group of a plurality of documents.
2. Description of the Related Art
A variety of document summarizing technologies has been studied and some working technologies have been practically developed. However, the almost all of the document summarizing technologies of the related arts are targeted to one single document. In practice, there are needs for summarizing a plurality of documents for picking up the overview thereof. These methods developed for only summarizing one document are not applicable to a collection of documents and they result in an inappropriate summary.
Examples of popular methods in the related art include a method of picking up important sentences, and a method of abstracting. In the related art, based on the frequency of appearance of words, a location in a document or in a paragraph, usage of proper nouns and so on, a score is given for each sentence of the document Sentences with higher scores are picked up until the number of sentences or the whole length of summary becomes equal to a pre-selected value to enumerate them to create a summary. If such a method is applied to a plurality of documents, sentences that will be selected from one of documents in a group will represent a group of documents and may not be appropriate for a summary thereof.
There are needs for summarizing a plurality of documents. Summarizing technologies for a plurality of documents may include:
(1) Enumeration of Keywords
The keyword enumeration method enumerate the most frequent words appeared in a document cluster. One example is the classification technology documented in the paper of Cutting, et al., xe2x80x9cScatter/Gather: A cluster-based Approach to Broweing Large Document Collectionxe2x80x9d, SIGIR-92 (1992). Some inventions based on this method include the Japanese Published Unexamined Patent Application No. Hei 5-225256, and the U.S. patent application Ser. No. 5,442,778. A preselected number of keywords that appeared frequently in the group of documents will be enumerated.
(2) Generation of Sentences Based on the Extracted Meanings
A method of sentence-synthesis based on the extracted meanings is described in the paper of McKeown and Radev, xe2x80x9cGenerating Summaries of Multiple News Articlesxe2x80x9d SIGIR-95 (1995); one example thereof is SUMMONS (SUMMarizing Online NewS articles). This technology uses slots in a given template to be fulfilled with information extracted from a plurality of documents. The information embedded in the template will be used as the conceptual structure for generating a summary of a pattern matched with the syntax.
(3) Synthesis of Following-up Articles
The technology described in the paper by Funasaka, Yamamoto and Masuyama, xe2x80x9cSummarizing relational news articles by reducing redundancyxe2x80x9d Natural Language Processing, 114-7 (1996) generates a summary or a plurality of documents by reducing redundancy from between a plurality of following-up news articles and synthesizing them. The following-up news articles, in general, may contain some paragraphs describing the course of an event as the background. The description of the background will be redundant if there is an article on the background. Accordingly reducing the redundancy between articles and synthesizing them may generate a summary without redundancy.
(4) Synthesis of a Plurality of Sentences
In this method a summary will be synthesized by identifying the sentences sharing the same meaning from between articles of the same event (for example, news articles of a plurality of news companies describing the same affair).
The document summarizing apparatus disclosed in the Japanese Published Unexamined Patent Application No. Hei 10-134066 gathers similar paragraphs (of online news of other news companies) to a specified paragraph (of online news). The gathered paragraphs are then disassembled to sentences to regroup similar sentences Here the similar sentences may be defined to have the number of pattern-matched words greater than a threshold value. For example, xe2x80x9cTyphoon #5, landing in Kyushuxe2x80x9d or xe2x80x9ca large typhoon #5 lands in Kyushuxe2x80x9d, etc.
A representative sentence for each of these groups will be generated. The manners to generate a representative sentence may comprises, for example, selecting one therefrom, generating a common set of blocks, or generating a union set. The common set of the example above may be xe2x80x9cTyphoon #5, landing in Kyushuxe2x80x9d and the union set may be xe2x80x9ca large typhoon #5 lands in Kyushuxe2x80x9d.
A method disclosed in the paper by Shibata. et al., xe2x80x9cMerging a Plurality of Documentsxe2x80x9d, Association of Natural Language Processing 120-2 (1997) also identifies a common sentence sharing the similar meanings from news articles of a plurality of news companies describing a same affair to synthesize a set therefrom. The manners of synthesis comprises an xe2x80x9cANDxe2x80x9d set (common set of elements), and an xe2x80x9cORxe2x80x9d Set (union set of elements)
However, the technologies of the Prior Art suffers from the problems as follows:
(1) The enumeration of keywords cannot indicate the relational dependencies between words, since words are appeared independently. The reader has to guess the meaning behind them from the sequential order of keywords and from a variety of knowledge thereon. In order to guess what the collection of documents would say, the reader is required to have some knowledge on the field of the subject or the knowledge on the event described in the collected documents.
(2) The generation of sentences from the extracted meanings is definitively limited to a narrow class of documents to be processed. This method has the definitive paragraphs subjected, such as articles on an affair of terrorism (xe2x80x9cwho did attack what, where, when and how, the victims and demolished buildings are . . . xe2x80x9d). A meaning template for each kind of affairs should be predefined. This method may be used only for articles on the same affair. However, it may not be applicable to a collection of documents gathered as the result of search or of clustering.
(3) The synthesis of following-up articles deals with the parent article and following articles of the same affair. Therefore this method is not applicable to a group of documents gathered as the result of search or of clustering.
(4) The synthesis of a plurality of sentences is applicable only to the articles on the same affair. Therefore this method is not applicable to a group of documents gathered as the result of search or of clustering.
The present invention has been made in light of these problems, the present invention provides a document summarizing apparatus, which generates a comprehensive summary when processing a group of documents of relatively diverse contents.
Also, the present invention provides a document summarizing method, which in applicable to a group of documents of relatively diverse contents for generating a comprehensive summary therefrom.
In addition, the present invention provides a computer-readable recording medium carrying a document summarizing program, which may be used with a computer to generate a comprehensive summary about a group of documents of relatively diverse contents.
In order to solve the problems as described above, a document summarizing apparatus according to the present invention for generating a summary of a set of documents, comprises: a sentence analyzing unit that analyzes the syntax (structure) of sentences contained in the documents specified to be processed to generate an analysis graph describing the relational dependencies between words; an analysis graph scoring unit that scores the analysis graph generated by the sentence analyzing unit based on importance; an analysis graph score accumulating unit that stores the analysis graphs scared by said analysis graph scoring unit to combine the analysis graphs having the same concept to increase the scores given to the combined analysis graphs according to the combined contents; and a sentence synthesizing unit that selects graphs with higher scores from the group of analysis graphs stored in said analysis graph score accumulating unit when the analysis graphs have been generated from all specified documents to be processed and accumulated in said analysis graph score accumulating unit, in order to synthesize a summarizing sentence based on the selected analysis graphs.
In the document summarizing apparatus as disclosed in the present invention, once a plurality of documents are specified to be processed, the sentence analyzing unit analyses the syntax of sentences contained in each of specified documents to generate an analysis graph describing the relational dependencies between words. The analysis graph scoring unit then scores the generated analysis graphs based on importance. The scored analysis graphs will be stored in the analysis graph score accumulating unit. When storing graphs, the analysis graph score accumulating unit combines graphs having the same concept to accumulate the score given to the combined analysis graphs according to the combined contents.
In order to solve the problems as described above, a document summarizing method according to the present invention comprises the steps of: analyzing the syntax of sentences contained in the documents specified to be processed to generate an analysis graph describing the relational dependencies between words; scoring the analysis graph generated by the sentence analyzing unit based on importance; storing the scored analysis graphs to combine the analysis graphs having the same concept one with another; increasing the scores given to the combined analysis graphs according to the combined contents; synthesizing a summarizing sentence based on the selected analysis graphs by selecting graphs with higher scores from the group of stored analysis graphs when the analysis graphs have been generated and accumulated from all specified documents to be processed.
In the document summarizing method as disclosed in the present invention, when a plurality of documents are specified to be processed, analysis graphs will be generated from the sentences contained in the specified documents and a summary will be synthesized based on the analysis graphs with higher importance.
In order to solve the problems as described above, a computer-readable recording medium carrying a document summarizing program for generating by a computer a summary from a set of documents, according to the present invention, comprises a document summarizing program for use with a computer, including: a sentence analyzing unit that analyzes the syntax of sentences contained in the documents specified to be processed to generate an analysis graph describing the relational dependencies between words; an analysis graph scoring unit that scores the analysis graph generated by the sentence analyzing unit based on importance; an analysis graph score accumulating unit that stores the analysis graphs scored by said analysis graph scoring unit to combine the analysis graphs having the same concept to increase the scores given to the combined analysis graphs according to the combined contents; and a sentence synthesizing unit that selects graphs with higher scores from the group of analysis graphs stored in said analysis graph score accumulating unit when the analysis graphs have been generated from all specified documents to be processed and accumulated in said analysis graph score accumulating unit, in order to synthesize a summarizing sentence based on the selected analysis graphs.
The functions in a document summarizing apparatus according to the present invention can be configured on a computer running on a computer a document summarizing program carried on the recording medium as described above.