1. Field of the Invention
The present invention relates to an apparatus and method for summarizing machine-readable documents written in a natural language, etc. In particular, it is intended to support a user to read a long document on a computer display, such as a manual, report, book, etc., by generating a summary which can be accommodated in approximately one screen and provides a user with the essence of the document content.
2. Description of the Related Art
For a prime text summarization technology which is currently actually used, there is a technology of generating a summary by detecting and extracting a key sentence in a document. This technology is further classified into several methods according to a clue used to evaluate the importance of a sentence. For the typical methods, there are the following two methods.    (1) A method utilizing the appearance frequency and distribution of words in a document as clues; and    (2) A method utilizing the coherence relation between sentences and the appearance position of a sentence as clues.
The first method, first determines the importance of words (phrases) in a document and evaluates the importance of the sentence according to the number of important words contained in the sentence. Then, the method selects key sentences based on the evaluation result and generates a summary.
As methods for determining the importance of a word in a document, there are several well-known methods as follows: a method of utilizing the appearance frequency (number of times of use) of the word in a document without modification, a method of weighing the appearance frequency of the word with a difference between the appearance frequency of the word and the appearance frequency of the word in a more general document collection, etc., and a method of weighing the appearance frequency of the word with the appearance position of the word, for example, weighing a word which appears in a heading to be important, etc.
The text summarization method of this kind, for example, includes the following methods.
Japanese Patent Application Laid-open No. 6-259424 “Document Display Apparatus and Digest Generation Apparatus and Digital Copying Apparatus” and a piece of literature by the inventor of the invention (Masayuki Kameda, “Extraction of Major Keywords and Key Sentences by Pseudo-Keyword Correlation Method”, in the Proceedings of the Second Annual Meeting of the Association for Natural Language Processing, pp. 97-100, March 1996) generates a summary by extracting parts including many words which appear in a heading as important parts deeply related to the heading.
Japanese Patent Application Laid-open No. 7-36896 “Method and Apparatus for Generating Digest” extracts major expressions (word, etc.) as seed from a document based on the complexity of an expression (length of a word, etc.) and generates a summary by extracting sentences including more of the major expression seed.
Japanese Patent Application Laid-open No. 8-297677 “Method of Automatically Generating Digest of Topics” detects “topical terms” based on the appearance frequency of words in a document and generates a summary by extracting sentences containing many major “topical terms”.
The second method judges the (relative) importance of sentences based on the coherence relation between sentences, such as sequence, contrast, exemplification, etc., or the position of sentences in a document, etc., and selects important sentences.
This method is introduced in pieces of literature, such as Japanese Patent Application Laid-open No. 6-12447 “Digest Generation Apparatus”, Japanese Patent Application Laid-open No. 7-182373 “Document Information Search Apparatus and Document Search Result Display Method” and a piece of literature by the inventors of these inventions (Kazuo Sumita, Tetsuro Chino, Kenji Ono and Seiji Miike, “Automatic Abstract Generation based on Document Structure Analysis and Its Evaluation as a Document Retrieval Presentation Function”, in the Journal of the Institute of Electronics Information and Communication Engineering, Vol. J78-D-II, No. 3, pp. 511-519, March 1995), and a piece of literature by another author (Kazuhide Yamamoto, Shigeru Masuyama and Shozo Naito, “GREEN: An Experimental System Generating Summary of Japanese Editorials by Combining Multiple Discourse Characteristics” in the IPSJ SIG Notes, Information Processing Society of Japan, NL-99-3, January 1994), etc.
These text summarization technologies are effective for a single-topic text, such as a newspaper article, editorial, thesis, etc., but it is difficult to generate a summary of a long text which comprises several parts of different topics.
According to the first method, it is difficult to determine the importance of words in such a multi-topic text because important words should differ for each topic.
According to the second method, coherence relation between sentences, which is expressed by a conjunction, etc., is local. Therefore, it is difficult to judge the relative importance among large textual units, such as those beyond a section, because they are usually constructed only with weak and vague relations or arranged almost at random from the view point of coherence relations.
Under these circumstances, a technology for generating a summary in combination with a technology for detecting topic passages in a document has been developed to solve this problem.
For example, a piece of literature by the inventor of the present invention (Yoshio Nakao, “Digest Generation based on Automatic Detection of Semantic Hierarchic of a Text”, in the Proceedings of a Workshop held alongside the Fourth Annual Meeting of the Association for Natural Language Processing, pp. 72-79, March 1998) and a prior Japanese Patent Application No. 10-072724 “Digest Generation Apparatus and Method thereof” (corresponding U.S. application Ser. No. 09/176,197) discloses a technology for detecting the hierarchical structure of topics in a document and extracting sentences containing many words characteristic of each topic.
Japanese Patent Application Laid-open No. 11-45278 “Document Processing Apparatus, Storage Medium recording Document Process Program and Document Process Method” discloses an idea of dividing an entire document into several sub-documents, detecting the break of a topic flow by checking the lexical similarities between the sub-documents and generating a summary for each topic.
Although this literature only briefly discloses the detection method of the change of topics at an abstract level, it is considered to be a variant of the prior art, such as a piece of literature by Salton et al.(Gerard Salton, Amit Singhal, Chris Buckley and Mandar Mitra, “Automatic Text Decomposition using Text Segments and Text Themes, in Proc. of Hypertext '96, pp. 53-65, the Association for Computing Machinery, March 1996).
Although it does not aim to generate a summary of a long document, Japanese Patent Application Laid-open No.2-254566 also presents a text summarization method based on topic passage detection. It detects semantic paragraphs by connecting a series of structural paragraphs (paragraphs structurally distinguished by an indentation, etc.) based on their content relevance, and generates a summary using keywords with a high appearance frequency extracted both from the entire text and from each semantic paragraph.
However, there is a problem relates to textual coherence of a summary. To make a very short summary of less than 1% of a source text, only a small number of sentences can be extracted among many important sentences. Therefore, a summary generated simply by extracting important sentences may become merely a collection of unrelated sentences. Furthermore, an important point with originality should be new information and needs some introductions for a reader to understand.
At this point, some appropriate mechanisms are required for improving textual coherence of a summary and for making a summary understandable in addition to the conventional text summarization technology described above.
In addition, there is another problem relates to readability of a summary. A summary of a long text naturally becomes long. For example, a summary of a book of one hundred pages will be one page even in a high compression rate of 1%. A one-page summary is much shorter than such a long source text, but is too long for a user to read easily without some breaks indicating turns of topics or discussions. Even for a entire expository text, a piece of literature by Yaari (Yaakov Yaari, “Texplore-exploring expository texts via hierarchical”, in Proceedings of the Workshop on Content Visualization and Intermedia Representations (CVIR '98), Association for Computational Linguistics, August 1998) proposed a method for visualizing a hierarchical structure of topics with generated headers to assist a reader in exploring content of an expository text, it is strongly required for a summary to help a user to understand quickly.
At this point, the Japanese Patent Application Laid-open No. 6-12447, described above, also discloses a technology for generating a summary for each chapter or section which is detected using a rendering features of a logical document element, such as a section header tends to comprise a decimal number followed by capitalized words. However, such a method that detects a large textual unit based on rendering features is not expected to have wide coverage. In other words, since rendering features of logical elements vary according to document types, there is a problem that heuristic rules for detection must be prepared according to every document type. Moreover, the logical structure of a text does not always correspond to its topic structure, especially in such a case that a section comprises an overview clause followed by the other ones that can be divided into several groups by the subtopics they discuss.
To avoid these problems, the present invention using a method to detect the hierarchical topic structure of a source text not by rendering features but by linguistic features a text has in general, and provide a mechanism to improve the readability of a summary based on the hierarchical topic structure of a source text.