A process of extracting important parts from a document is one of essential processes in a document summarization process. The process of extracting basically comprises giving importance to each sentence of the document quantitatively and extracting sentences having high importance. Conventional techniques of document summarization processing are described in “Automated Text Summarization: A survey” by M. Okumura and E. Nanba, in Journal of National Language Processing, Vol. 6 No. 6, July 1999. This literature enumerates seven features used for evaluating importance, including (1) occurrence frequencies of terms in a document, (2) positional information within the document, (3) document title information and (4) text structure obtained through an analysis of the relation between sentences. In particular, the information of the occurrence frequencies of terms in a document is regarded as a basic feature because content terms that occur frequently in the document tend to indicate the topic of the document. Some specific methods that utilize this information includes a method for giving a weight to each term in accordance with its frequency of occurrence within an input document and define importance of each sentence based on the sum of weights to the terms contained in each sentence, and a method for weighting to each term using information of not only the occurrence frequencies of terms but also the number of the documents containing each term within a set of the documents.
The above-referenced literature describes a dynamic document summarization technique. When presenting a retrieving result to a user, it indicates to the user the important parts of the document related with a user's query as a summarization and helps the user determine quickly and accurately whether the retrieved document matches the query. The above-referenced literature also describes one of the conventional methods, that is, a method for retrieving important sentences reflecting relatedness with a query. In the method, document importance calculated with occurrence frequencies of terms within the document is added to scores obtained based on frequencies with which terms in the query occur within the object document.
A process for determining document similarity is essential to automatic document classification and document retrieval, in particular, similarity based retrieval for retrieving documents that are similar to a user specified document. In the process for determining document similarity, a document is often represented in a vector form. In the following description, a vector that is generated from an entire document is called a document vector, a vector generated from a part of a document is called a document segment vector, and, particularly, a vector generated from a sentence is called a sentence vector. Various methods are well known to define element values of a document vector; for example, a method that compares an occurrence frequency of each term in a concerned document with a predetermined value to give 1 or 0 to each vector element, a method that uses occurrence frequency, and a method that gives a value obtained by multiplying occurrence frequency by logarithm of an inverse of the ratio of the number of the documents in which the corresponding term occurs to the total documents number. Such document representation methods are commonly used in the vector space model.
Such document vectors indicate which terms occur in the document and how often those terms occur in the document. Since it is considered that a document concept is represented by which and how often terms occur in the document, the direction of the obtained document vector can be regarded as representing the document concept. In addition, the occurrence frequency of terms in the document is related to a vector norm. The value of squared norm of the obtained document vector can be regarded as representing the strength or energy of the concerned document.
Similarity measured by cosine between two vectors is often used to determine the similarity between two documents represented by the vectors. This similarity is defined as a value obtained by dividing an inner product of the two vectors by norms of each vectors. Since the direction of the document vector represents a concept as described above, such similarity does not reflect the energy difference between the documents but does reflect only the difference between concepts.
It is objective of the invention to provide a document analysis method for extracting important sentences from a given document and/or determining the similarity of two documents and a method for representing documents suited for the document analysis method.
In extracting important sentences from a document, the sentence having the concept close to a central concept of the concerned document should be given high priority. Accordingly, it is essential to determine the central concept of the document and obtain the relationship of concepts between each sentence and the entire document. However, in the conventional methods where sentence importance is defined by a sum of weights of each term, it is not always clear how degree the sentence importance reflects the central concept of the document. Consequently, longer sentences tend to be extracted as important sentences only for the reason that they are long. In addition, since the conventional methods do not obtain the relationship of the concepts between each sentence and the entire document, it is not ensured that the sentence whose concept is close to the central concept of the concerned document be always extracted.
In extracting sentences that are important and related to a query from a document, a method is often adopted that obtains frequencies that terms in the query occur in the target sentences. In this case, the score will be zero if both the query and a target sentence do not share the same term. In practice, even if no common term is contained in both the query and the target sentence, it is desirable that non-zero relatedness should be obtained if one of a pair of terms co-occurring frequently in the document is included in the query and the other in the target sentence. For example, assume a document containing a paragraph that introduces the relationship between “Tokyo” and “Ginza”. When a user issues a query including “Tokyo”, it is desirable for the system to be able to present the user sentences including “Ginza” as well as sentences including “Tokyo”.
In determining the similarity between two documents, the conventional methods represent a document using a single vector. In such a method, there have been a problem that the concept represented by the vector is ambiguous and a problem that the spread of the concept could not be represented. For example, assume that a, b, c, and d represent certain terms respectively. A document containing combinations of a-b and c-d should be distinguished from another document containing combinations of a-c and b-d because those two documents seem to represent different concepts. However, with the conventional vector representation, vectors of two documents would be same, which means that distinguishing two documents is difficult. Besides, since a document is usually composed of many sentences and each sentence has its own concept, a concept represented by the document has spread. It is difficult, however, to represent such spread of the document concept with a single vector. Thus, since the document concept is not represented precisely in the conventional methods, it has been difficult to obtain the similarity between documents correctly in conventional document retrieval and classification processing.