1. Field of the Invention
The present invention relates to an apparatus for summarizing an electronic document written in a natural language, and has been developed to select and access a large volume of retrieved documents, and access, restructure (repeatedly use), and support the management process of a large volume of accumulated documents.
Recently, documents have been stored on electronic media, and an explosively-increasing number of document s are accessed and repeatedly used on computers using new document communications media such as the Internet/Intranet, etc. Under the circumstances, the technological development is accompanied by a larger volume and a larger variety of technological documents, thereby increasing the number of requests for accumulating and repeatedly-using a large volume of documents.
With such a large volume of documents, the effectiveness of each document should be quickly determined to select an appropriate document to the purpose. To attain this, it i s necessary to display a list of documents together with the information implying the contents of the documents. The information to the purpose can be a title or an abstract of a document. However, the title may not practically represent the contents of the document, or an abstract may be missing. When a document is accessed online, the number of characters to be displayed is limited. Therefore, an abstract may not be appropriately displayed because it contains too many characters. Thus, a technology of automatically generating an appropriate summary is earnestly demanded.
When documents are used efficiently and repeatedly, a large volume of documents should be properly classified and arranged when accumulated. At this time, an appropriate summarization is required to quickly understand the contents of a new document to be classified, obtain the outline of the classification so that the administrator of the accumulated document can improve the classification system, and to inform a user unfamiliar with the classification system of the actual classification.
The feature of the present invention is to adjust a summarization result using the document summarization apparatus depending on the focused concept and the known concept of the user.
2. Description of the Related Art
There have been two major methods of generating the summary of a document in the conventional document summarization technology. The first method is to recognize and extract an important portion in a document (normally the logical elements of a document such as a sentence, a paragraph, a section, etc., and hereinafter referred to as a sentence), and generate a summary. The second method is to prepare a pattern of information to be extracted as a summary and make a summary after extracting words or phrases in the document according to the condition of the pattern or extracting sentences according to the pattern. Since the second method is little related to the present invention, the first method is described below.
The first method is further divided into a few submethods depending on what is the key to the evaluation of the importance of a sentence. A typical method depends on:
1. occurrence and distribution of words in a document; and PA1 2. coherence relation between sentences and position where a sentence appears.
(The importance of a sentence can also be evaluated by the syntax pattern of a sentence, but this method is omitted here because it hardly relates to the present invention.)
In method 1, that is, the method depending on the occurrence and distribution of words in a document, the importance of a word (phrase) contained in a document is normally determined first, and then the importance of the sentence is evaluated depending on the number of important words contained in the sentence. Then, an important sentence can be selected and a summary is generated. The importance of a word is calculated by using the occurrence of the word in a document, which can be weighed by taking into account the deviation of the occurrence of the word from the occurrence of the word in a common document set or the position where the word appears (a word appearing in a title is regarded as an important word, etc.). Normally, a focused word is an independent word in Japanese (especially a noun), and a content word in English. An independent word and a content word refer to a word having a substantial meaning such as a noun, adjective, verb, etc. that can be distinguished from syntactic words such as a preposition, an auxiliary, etc. The formal definition of an independent word in Japanese implies a word which can form part of an independent section in a sentence. This is a little different from the description above, but the purpose of limiting a focused word to an independent word is described above.
For example, method 1 is described in the following document.
In the Japanese Laid-open Patent Publication (Tokkaihei) No. 06-259424 "Document Display Apparatus, Document Summarization Apparatus, and Digital Copy Apparatus" and the following document 1 by the same author, a summary is generated by extracting a portion containing a number of words contained in the title as an important portion related to the title.
Document 1: Masayuki Kameda, "Extraction of Important Keyword and Important Sentence by Pseudokeyword Correlation Method", disclosed in the second annual meeting, Association for Natural Language Processing, pp. 97-100, March 1996.
In the Japanese Laid-open Patent Publication (Tokkaihei) No. 07-36896 "Document Summarization Method and Apparatus", a seed for an important representation is selected based on the complexity (word length, etc.) of the representation (word, etc.) in a document, and a summary is generated by extracting a sentence containing a larger number of important seeds.
In the Japanese Laid-open Patent Publication (Tokkaihei) No. 08-297677 "Automatic Method of Generating Summary of Subject", words of main subjects are recognized in order from the highest occurrence of a word in a document, and a summary is generated by extracting a sentence containing a larger number of important subject words.
In the Japanese Laid-open Patent Publication (Tokkaihei) No. 06-215049 "Document Summarization Apparatus", a summary is generated by extracting a sentence from a sentence or paragraph having a feature vector similar to that of the entire document after applying a vector space model often used in determining the relevance between a retrieval result and a question sentence. A vector space model refers to representing a feature of a document and a query sentence using a feature vector indicating the existence or occurrence of a word in the document and the query sentence after assigning a dimension (axis) to each keyword or each meaning element of a word.
In method 2 depending on the coherence relation between sentences and the position of the sentence, an important sentence is selected by determining the (relative) importance of the sentence based on the conjunction (also referred to as the coherence relation) of sentences such as `and`, `but`, `then`, etc., and the position where a sentence appears in a document. This method is described in, for example, the Japanese Laid-open Patent Publication (Tokkaihei) No. 07-182373 "Document Information Retrieval Apparatus and Document Retrieval Result Display Method" and the following document 2 by the same applicant and document 3 by other applicants.
Document 2: Kazuo Sumita, Tetsuo Tomono, Kenji Ono, and Seiji Miike. "Automatic abstract generation based on document structure analysis and its evaluation as document retrieval presentation function". Transactions of the Institute of Electronics, Information and Communication Engineers, Vol.J78-D-II, No. 3, pp.511-519, March 1995 (in Japanese).
Document 3: Kazuhide Yamamoto, Shigeru Masuyama, and Shozo Naito. "GREEN: An experimental system generating summary of Japanese editorials by combining multiple discourse characteristics". IPSJ SIG Notes NL-99-3, Information Processing Society of Japan, January 1994 (in Japanese).
In addition to the technology of generating a summary of an entire document as described above, there is a technology of presenting a user-focused portion to support the determination of the effectiveness of each document. As well-known technologies, a method of displaying the surrounding portion of a retrieved word referred to as a keyword in context (KWIC), and a similar method of displaying the vicinity of a retrieved word are popularly used.
There also is a method of presenting only a specific portion depending on a user's purpose such as a portion describing the background of a study in a thesis, the first paragraph of a newspaper, etc. Examples of this method are described in the Japanese Laid-open Patent Publication (Tokkaihei) No. 07-182373, document 3, and documents 4 and 5 by another applicant. However, in these technologies, a portion assigned a special function in a logical structure of a document is selected using a field-specific document configuration and wording as a clue. Therefore, a user-focused portion is not specifically selected, nor the portion closely related to the user-focused portion can be presented.
Document 4: Noriko Kando. "Functional structure analysis of research articles selected from three specialties: Automatic category assignment." Library and Information Science, No.31, pp. 25-38, 1993 (in Japanese).
Document 5: Noriko Kando. "Functional structure analysis of the research articles and its applications." Journal of Japan Society of Library Science, Vol.40, No.2, PP.49-61, June 1994 (in Japanese).
The factors of lowering the readability of a summary can be redundant representations, unknown words to users, unsolved anaphoric expressions (such as `it`, `this`, `that`), etc.
Among the above listed factors, redundant representations can be reduced by the method of deleting excess modifier elements by the heuristics based on the wording characteristics and correlation between modifier elements and modified elements, a distance between a modifier element and a modified element. For example, the above described document 3 presents a heuristics of deleting the first modifier element in the case that two or more elements modify a same noun to summarize a Japanese newspaper article. The following document 6 by the same authors presents another heuristics of deleting an introduction of a successive article in a series of relevant articles if 70% or more of nouns in the introduction are occurred in an introduction of the former articles.
Document 6: Takahiro Funasaka, Kazuhide Yamamoto, and Shigeru Masuyama. "Relevant newspaper articles summarization by redundancy reduction." IPSJ SIG Notes NL-114-7, Information Processing Society of Japan, July 1996 (in Japanese).
It is obvious that definitions and descriptions of words, if any, should be included in a summary to solve the problem of unknown words.
For an anaphoric expression, its antecedent is searched for and the anaphoric expression is replaced with the antecedent or a portion containing the antecedent is included in a summary so that the summary can be easily understood. The antecedent of the anaphoric expression can be identified by a method referred to as a centering method. This method makes a list of centers that comprises probable elements (centers) of a sentence to be antecedents of anaphoric expressions in the subsequent sentences. The elements probability to be an antecedent is calculated mainly by its syntactic role in a sentence, such as subject, direct object, etc. Then, the method resolves an anaphoric expression by selecting the most probable element from the list with the restriction of agreement of number, gender, etc. In a similar method, a center is also referred to as a focus. However, no technologies can obtain a perfect result. The centering methods are described in the following documents.
Document 7: Megumi Kameyama. A property-sharing constraint in centering. In Proceeding of the 24th Annual Meeting of Association for Computational 1 Linguistics, pp.200-206, 1986.
Document 8: Susan E. Brennan, Marilyn W. Friedman, and Carl J. Pollard. A centering approach to pronouns. In Proceedings of the 25th Annual Meeting of Association for Computational Linguistics, pp. 155-162, 1987.
According to the above described Japanese Laid-open Patent Publications (Tokkaihei) No. 07-182373 and No. 07-44566 "Abstract Generation Apparatus" by the same applicants, the method is implemented to estimate the position of the definition of an unknown word and an antecedent of an anaphoric expression, and a hyper-textual link is set based on the original word or an anaphoric expression, thereby realizing a user's convenience.
To select an effective document from a large volume of documents, it is important to inform a user how the author of a document treats a topic relevant to the user-requested information. It is helpful for a user to determine the document relevance. In a retrieval system, user-requested information is often represented as a query sentence or a query expression using a keyword. However, user-requested information is not fully described in those forms. A document containing a word in a query sentence or a query expression does not necessarily supply the user-requested information. For example, when a patent gazette is searched using a keyword `translation,` a retrieval result may contain a large number of patents about the translation of machine language although the user requests to obtain information about the patents relating to the translation of sentences in a natural language. In this case, presenting the word `translation` in a context may be able to correctly support the selection of a document. The above described KWIC can be used for these purposes, but it is difficult to grasp the flow of a logic because only a physical vicinity is indicated, and a concise summary to the purpose cannot be easily prepared.
From this point of view, only the importance of a sentence in a document is taken into account in determining whether or not the sentence is included in a summary by the conventional summarization technology as described above. Therefore, a user's request is not considered. As a result, if a keyword matches an unimportant portion of a document, such as an example in a linguistic document, an automatic generated summary of the retrieved document makes a user confused because it does not contain the portion relevant to the user's request.
Described below is a further problem with the linguistic document. In a linguistic document, the formal nature of a language is discussed, and the contents of an example given in the document does not have to be related to the linguistic discussion. For example, the Japanese sentences "An elephant has a long nose." is frequently cited linguistic examples. When a user searches for information about animals, a document containing such examples can be retrieved. Since the document is a linguistic document, the occurrence of words relating to animals is small when the frequency of the words in the document is checked, and it is figured out that an elephant is not an important word. If an automatically-generated summary is displayed as a retrieval result based on the frequency distribution, such examples are hardly contained in the summary, thereby makes a user confused. That is, when a keyword `elephant` is input, such a linguistic document may be retrieved, but is not contained in the display (automatically-generated summary) of a retrieval result, and the user cannot understand why such a word could be retrieved. On the other hand, when only the vicinity of a keyword is displayed, only an example is displayed and the user cannot understand what the document is about because only an example portion is displayed.
Another problem with the conventional summarization technology is that it includes no units for generating a summary depending on a user's knowledge level. Since a knowledge level depends on each user, the definitions and descriptions should be prepared for a summary according to each user's knowledge level of technical terms. Otherwise, a user of a high knowledge level may find a redundant summary while a user of a low knowledge level will hardly understand a difficult summary.