1. Field of the Invention
The present invention relates to a method of and apparatus for automatically extracting an important item, e.g., words, phrases, and/or sentences from a document, and more particularly to such a method and apparatus wherein the importance of the items is determined from eigenvectors and eigenvalues of a square sum matrix of segments of the documents.
2. Description of the Related Art
Considerable research and development have been made in the field of document and information searching for automatically extracting important items, e.g., words, phrases and sentences, from a document in the field of document and information search. Techniques for such automatic extraction are roughly divided into heuristic and static approaches.
The heuristic approach uses document headline information, in-document positional information, and cue expression. The document headline information method is based on the concept that “the document title or headline, briefly expressing document content, includes important terms.” The important terms are obtained by excluding from the terms in the headline or title unimportant terms such as articles and prepositions. The heuristic method is premised on the existence of a title or headline, and is not applicable to a document which does not include a title or headline.
The in-document positional information method relies on the fact that an important sentence is intentionally written in the initial part of a newspaper article or the like, such that important terms are extracted from a sentence at the front part of an article. This method can be used only if it is known in advance where the important part of a document is, as in a newspaper article.
The cue expression method is premised on an important sentence beginning with a particular phrase, e.g. “as a result.” In the cue expression method, such a particular phrase is extracted by natural-language processing so the range of extracted important terms is limited to sentences including such phrases. The cue extraction method can not be applied to works or paragraphs that do not have such a premised cue expression.
In the well-known statistical approach method, an important term is defined as a frequently occurring term in an object document. This method uses in-document occurrence frequency (tf) as a measure of importance. A problem with this method is that the high-frequency term in a document is not necessarily an important term. A so-called tf-idf model has been developed to solve the problem. The tf-idf model is based on the concept that a term occurring in many documents is less important and importance of a term is inversely proportional to the number of documents including the term, and that importance of the term in a particular document is directly proportional to the occurrence of the term in the document. The expression “tf-idf” is defined as the product of tf and idf, where idf is the inverse of df, and df is the number of documents including the term in a corpus in which the object document is included. This model is a well-known approach. However, because the definition is based on the product of in-corpus term importance and in-document term importance, there still remains a problem of how to accurately define in-document importance.
When a document is given as mentioned above, it is important how in-document importance of each term is determined. The calculation of in-document importance is premised on using only the information contained in a given document. The foregoing term importance within a corpus is a quantity related to the probability for a term to occur in one document. On the other hand, since in-document importance must be obtained within one document, in-document importance should be a measure of the extent to which the term represents the document content, i.e. document concept. Accordingly, during extraction of important terms/phrases from a document, terms/phrases representative of concepts of the document take top priority. For this reason, it is necessary that extraction of central concepts of a document lead to a grasp of the relationship between a term/phrase and the central concept of the document. In the conventional methods, however, it is not necessarily clear to what degree an extracted important term/phrase reflects central concepts of a document. Accordingly, it often happens that terms/phrases irrelevant to a document concept are regarded as important or that terms/phrases which merely have a high frequency are extracted as important terms/phrases.