1. Field of the Invention
The present invention relates to an apparatus and method for summarizing machine-readable documents written in a natural language, etc., and mainly intends to generate a digest of rather long manuals, reports, etc. and to support the selection and reading processing of documents.
2. Description of the Related Art
As a prime technology related with the present invention there are two technologies of generating a digest by extracting sentences using keywords in a document as a clue, and detecting topic passages in the document. Here, these conventional technologies are described below.
First, the digest generation technology is described below. Roughly speaking, in the conventional digest generation technology, there are two methods. The first method detects major parts in a document and generates a digest by extracting the major parts. The major parts are usually extracted in units of logical elements such as sections, paragraphs, sentences, etc. These are represented by a word xe2x80x9csentencexe2x80x9d as described below.
The second method prepares in advance patterns of information to be extracted for a digest, and generates a digest by extracting phrases and words in the document meeting the requirements of one of the patterns, or generates a digest by using sentences matching the pattern.
The first method is further classified into several methods according to with what clue the importance of sentences is evaluated. As typical methods there are the following three methods.
(1) A method of utilizing the use frequency and distribution of words in a document as clues.
(2) A method of utilizing the rhetorical structure and used position of sentences as clues.
(3) A method of evaluating the importance of sentences based on the sentence structure.
Method (1) first evaluates the importance of words (phrases) contained in a document, and then evaluates the importance of sentences according to how many keywords are contained in a sentence. Then, a digest is generated by selecting key sentences based on the evaluation result.
There are several well-known methods of evaluating the importance of words as follows: a method of utilizing the use frequency of words in a document, a method of weighing the use frequency of words with differences between the use frequency of words in the document and that in a more general sentence collection, and a method of weighing the use frequency of words with the used position of words, for example, by setting higher importance to a word in titles or headings.
Here, usually the target words are limited to independent words (particularly nouns) only in the case of Japanese, and content words in the case of English. The independent word and the content word are both words with a substantial meaning, such as nouns, adjectives, verbs, etc., and are distinguished from words used to play a structural role only, such as particles, prepositions, formal nouns, etc. Although the formal definition of an independent word in Japanese is a word which itself can compose an independent clause, here the independent word is defined using the above distinction.
These digest generation methods include, for example, the following. In the Japanese Laid-open Public Patent Publication No. 6-259424 xe2x80x9cDocument Display Apparatus and Digest Generator Apparatus, and Digital Copying Apparatusxe2x80x9d and a document by the inventor of that invention (Masayuki Kameda, xe2x80x9cExtraction of Major Keywords and Key Sentences by Pseudo-Keyword Correlation Methodxe2x80x9d, in the Proceedings of the Second Annual Meeting of Association for Natural Language Processing, pp.97 to 100, March 1996), a digest is generated by extracting parts including many words appearing in the headings as important parts relating to the headings.
In the Japanese Laid-open Public Patent Publication No. 7-36896 xe2x80x9cMethod and Apparatus for Generating Digestxe2x80x9d, major expression seeds are selected based on the complexity (length of a word, etc.) of an expression (word, etc.) used in a document, and a digest is generated by extracting sentences including more seeds having a high importance.
In the Japanese Laid-open Public Patent Publication No. 8-297677 xe2x80x9cMethod of Automatically Generating a Digest of Topicsxe2x80x9d, topical terms are detected based on the use frequency of words in a document, and a digest is generated by extracting sentences containing many major topical terms.
In the Japanese Laid-open Public Patent Publication No. 2-254566 xe2x80x9cAutomatic Digest Generator Apparatusxe2x80x9d, words having a high use frequency are detected as keywords, and a digest is generated by extracting parts where the keywords are used in the first place, or parts containing many keywords, sentences which are used at the beginning of semantic paragraphs automatically detected, etc.
Next, the method of detecting topic passages in a document is described below. Roughly speaking, there are the following two methods.
(1) A method based on the lexical cohesion of a topic due to words repeatedly used in a document
(2) A method of determining a rhetorical structure based on the coherence relation between sentences indicated by conjunctions, etc.
For method (1) based on the lexical cohesion, first, the Hearst method (Marti A. Hearst, xe2x80x9cMulti-paragraph Segmentation of Expository Textxe2x80x9d, in the Proceedings of the 32nd Annual Meeting of Association for Computational Linguistics, pp.9 to 16, 1994) is briefly described below.
This method (hereinafter called xe2x80x9cHearst methodxe2x80x9d) is one of those automatically detect a break of a topic flow based on the linguistic phenomenon that an identical word is used repeatedly in related parts of text (lexical cohesion). The Hearst method, first, calculates the lexical similarity of every pair of adjacent blocks of text, which are set up before and after a certain position in a document to be of fixed size about a paragraph (approximately 120 words). The lexical similarity is calculated by a cosine measure as follows:                               sim          ⁡                      (                                          b                1                            ,                              b                r                                      )                          =                                            ∑              t                        ⁢                                          W                                  t                  ,                  b1                                            ⁢                              W                                  t                  ,                  br                                                                                                        ∑                t                            ⁢                                                W                                      t                    ,                    b1                                    2                                ⁢                                                      ∑                    t                                    ⁢                                      W                                          t                      ,                      br                                        2                                                                                                          (        1        )            
where bl and br indicate a left block (a block on the backward side of a document) and a right block (a block on the forward side of the document), respectively, and Wt,bl and Wt,br indicate the use frequency of a word t in the left and right blocks, respectively. xcexa3t in the right hand side of equation (1) is a summation operator about different words t.
The more vocabulary common to both the blocks there is, the greater the similarity score of equation (1) becomes (maximum 1). Conversely, if there is no common vocabulary, the similarity score becomes the minimum values 0. That is, a greater value of the similarity score indicates a higher possibility that a common topic is handled in both the blocks, while a smaller value of the similarity score indicates a higher possibility that the point between the blocks is a topic boundary.
The Hearst method compares the value of equation (1) from the beginning of a document until the end at certain intervals (20 words), and recognizes a position having a minimal value as a topic boundary. At this time, the following adjustment is performed in order to neglect the fine fluctuations of the similarity score. First, a part surrounding the point mp having a minimal value (hereinafter called a xe2x80x9cminimal pointxe2x80x9d) is extracted so that the part includes both a part where the similarity score decreases monotonously on the left side of the minimal point and a part where the similarity score increases monotonously on the right side of the minimal point.
Then, based on the similarity scores Clp, Cmp and Crp at the start point lp, the minimal point, and end point rp, respectively, of the extracted part, a value ds (depth score), which indicates the fluctuation steepness of the similarity score at the minimal point, is calculated as follows:
ds=(Clpxe2x88x92Cmp)+(Crpxe2x88x92Cmp)xe2x80x83xe2x80x83(2)
Then, only when ds exceeds a threshold h calculated as follows, is the minimal point recognized as a topic boundary.
h=C0xe2x88x92"sgr"/2xe2x80x83xe2x80x83(3)
where C0 and "sgr" are the mean value and the standard deviation of the similarity, respectively, of an entire document. According to this method, it is considered that the more steeply the similarity of a part decreases, the higher the possibility of being the boundary of a topic the part has. Hearst also shows another method of detecting a topic boundary by keeping track of active chains of repeated terms so that a point at which the bulk of one set of chains ends and another set of chains begins should be identified with a topic boundary.
For another method of detecting topic passages, a method of using a sentence beginning with a clause with a topic-marking particle xe2x80x9cwaxe2x80x9d in Japanese as a clue is also widely known (Japanese Laid-open Patent Publication No.7-160711 xe2x80x9cTopic Structure Detection Method and Apparatus for Written Language Textxe2x80x9d). A method using this method and a method similar to the second version of Hearst method together is also widely known (Gen Mochizuki, Takeo Honda and Manabu Okumura, xe2x80x9cText Segmentation Using a Multiple Regression Analysis and a Cluster Analysisxe2x80x9d, in the Proceedings of the Second Annual Meeting of the Association of Natural Language Processing, pp.325 to 328, March 1996).
However, the conventional digest generation methods have the following problems.
For such a method as of determining the keywords of a document and generating a digest of the document by extracting sentences having many keywords, it is difficult to generate a digest of a long document, especially one that is composed of several parts of text concerning different topics. Since different sets of keywords are required for those parts concerning different topics, simple keyword extraction based on the use frequency of a term in an entire document is not appropriate. If a digest is generated based on a set of keywords that are used frequently in one part of text but infrequently in another part, the resulting digest may include sentences of no importance extracted from the part where the keywords used infrequently.
In order to solve this problem, it is necessary to detect topic passages in a document. However, there is no method of directly detecting large topic passages based on lexical cohesion, which is another problem.
In the conventional technologies, when topic passages are detected based on lexical cohesion, in a similar manner to Hearst method, only topic passages having several paragraphs or at most one article of a newspaper are tried to be detected. Topic passages larger than those were detected using document patterns such as chapters with a clue of characteristic patterns in the physical appearance of document, such as characteristic layout of chapters, etc. (hereinafter called xe2x80x9cdocument patternxe2x80x9d).
For example, in the Japanese Laid-open Patent Publication No. 2-254566, a series of formal paragraphs (paragraphs formally separated by indentations, etc.) having close contextual relation are automatically detected as semantic paragraphs, and a digest is generated based on two types of keywords: keywords extracted based on use frequency in an entire document and those extracted based on the use frequency in each semantic paragraph. However, in this method, the semantic paragraphs never go beyond the breaks of a larger logical element of a document, such as a chapter, clause, etc. This is because breaks of a larger logical element, which are detected by a document pattern, are given priority over dividing points of semantic paragraphs, and there is no more process to combine larger logical elements.
Even in the detection of topics, since the major clue in the detection of semantic paragraphs is a term repeatedly used in the range of adjacent two formal paragraphs, it is difficult to detect a larger topic passage. Although the position information of a term used in the first place is also used, it is not sufficient to judge a lexical cohesion due to terms repeatedly used at long intervals, etc.
Clauses belonging to the same chapter sometimes have different semantic cohesion. In this case, a method of precisely detecting larger topic passages is required. In addition, since a document pattern is a rule regarding a specific kind of document, in order to apply this to the summarization of various kinds of documents, an empirical rule has to be prepared for each kind of a document, which is another problem.
It is an object of the present invention to provide a general-purpose digest generator apparatus and a method of automatically detecting the topic structure of a document based on phenomena observed in a general documentation such as lexical cohesion, and generating a digest corresponding to the topic structure.
In the first aspect of the present invention, the digest generator apparatus comprises a structure detection unit, an extractor unit, a selector unit and an output unit.
The structure detection unit detects the hierarchical structure of topics in a given document, and the extractor unit extracts keywords regarding each detected topic. The selector unit selects key sentences from topic passages based on the use condition of the keywords, and generates a digest using the key sentences. The output unit outputs the generated digest.
In the second aspect of the present invention, the digest generator apparatus comprises an extractor unit, a generator unit and an output unit.
The extractor unit evaluates whether or not a word is characteristic of a process target topic passage based on both the use frequency of the word in the process target topic passage in a given document and the use frequency of the word in a longer topic passage containing the process target topic passage, and extracts keywords from the target topic passage based on the evaluation result. The generator unit generates a digest based on the use condition of the extracted keywords, and the output unit outputs the generated digest.
In the third aspect of the present invention, the digest generator apparatus comprises an extractor unit, a generator unit and an output unit.
The extractor unit extracts local keywords from a topic passage for digest generation, and extracts global keywords from a longer topic passage containing the input topic passage. The generator unit generates a digest based on the use condition of both the extracted local keywords and global keywords. The output unit outputs the generated digest.
In the fourth aspect of the present invention, the digest generator apparatus comprises a cohesion calculator unit, a major part specifying unit, a generator unit and an output unit.
The cohesion calculator unit calculates a lexical cohesion in the neighborhood of each position in a given document, and the major part specifying unit removes areas having a lower cohesion from a process target, and extracts areas having a higher cohesion as major parts. The generator unit generates a digest using the major parts, and the output unit outputs the generated digest.