(1) Field of the Invention
The present invention relates to a related term extraction apparatus, a related term extraction method, and a computer-readable recording medium having a related term extraction program recorded thereon, all of which are suitable for use in extracting related terms from mass-storage document data.
(2) Description of the Related Art
The most common existing practice for extracting related terms from document data causes manual extraction of document data terms that can be considered related and forming a list of the thus-extracted terms, or preparing a related-term list by utilization of a manually prepared thesaurus.
Several techniques for preparing a related-term list, which will be described later, have already been proposed as methods of automatically extracting related terms through use of a computer without requiring manual extraction operations.
One of the techniques involves preparation of a related-term list on the basis of the occurrence frequency of two related terms, i.e., the frequency of two terms cooccurring with each other in document data. The range within which two terms are defined as cooccurring with each other is set to various values, e.g., a range of within a few words, a range of within tens of words, a duration of within one minute, or a range of within one paragraph.
Other than the technique of simply aggregating the frequencies of two terms cooccurring with each other and of determining the terms having high concurrence frequencies to be related terms, the following techniques are also employed.
Specifically, in a technique that has already been proposed, a set of keywords (or a group of terms) is determined beforehand, and the frequencies of each keyword occurring with other terms are aggregated. A related-term list is prepared through such aggregation operations.
In another technique that has already been put forth, a document (or a written item) for which a related-term list will be prepared is subjected to morphological analysis, so that the part of speech of each term is determined. Subsequently, functional words are removed from the document, or the frequencies of only each content word cooccurring with other terms are aggregated. A related-term list is prepared through such aggregation operations.
In still another technique that has already been put forth, on the basis of the frequencies of terms cooccurring with a specified term in a document, terms having high frequencies of cooccurring with the specified term and terms having low frequencies of cooccurring with the specified term are removed during the process of preparation of a related-term list, thus preparing a related-term list.
In yet anther technique that has already been put forth, terms having special relationships are determined through syntax analysis, and the frequencies of the thus-determined terms cooccurring with each other are aggregated. A related-term list is prepared through such aggregation operations.
Other than the technique of using in its present form the frequency of two terms cooccurring with each other as a criterion for determining whether or not these terms are related to each other, there has already been proposed another technique (hereinafter referred to as "technique A") which uses a value called mutual information.
Here, the mutual information (or transferred information) represents a difference between information which is transferred as a result of ascertaining the occurrence of an event "x," and conditional information which is transferred as a result of ascertaining the occurrence of an event "x" on condition that another event "y" has occurred. Mathematically, the mutual information represents a pair of events xi, yi, where xi designates an input message and yi designates an output message. Taking p(xi, yi) as a joint probability of occurrence of events xi and yi; p(xi.vertline.yi) as a probability of an event xi occurring on condition that an event yi has occurred; p(yi.vertline.xi) as a probability of an event yi occurring on condition that an event xi has occurred; p(xi) as a probability of occurrence of an event xi; and p(yi) as a probability of occurrence of an event yi, mutual information (or transferred information) T(xi.vertline.yi) relating to the pair of events xi, yi is given by Equation 1 provided below. ##EQU1##
It is also conceivable that the degree of association of a specified term xi with a corresponding term yi can be calculated from the mutual information through calculation of the mutual information T(xi.vertline.yi) by the expression defined by Eq. 1, and that a related-term list can be prepared from values obtained by the calculation.
Manual preparation of a related-term list is laborious and adds to preparation cost. Further, in order to cause the related-term list to deal with new terms, such a manual-extraction technique is required for preparation of new related-term list each time new terms appear.
Even the method which uses a computer and determines key words beforehand requires determination of key words beforehand.
The method, which deletes functional words or solely extracts content words, requires acquisition of information regarding the part of speech of each term through use of a technique such as morphological analysis.
The method, which eliminates related terms of high and low occurrence frequencies, encounters difficulty in identifying as a term to be eliminated a term whose occurrence frequency is above ascertain level or below a certain other level.
The technique, which requires syntax analysis, becomes troublesome to an extent corresponding to the labor required for syntax analysis.
In the technique, which requires morphological analysis or syntax analysis, the analysis is also required to have sufficient performance. Further, in order to ensure sufficient performance, a dictionary or a grammatical database must be incessantly updated.
The technique A of preparing a related-term list through use of the expression regarding the mutual information T(xi.vertline.yi) shown in Eq. 1 is not necessarily required to determine beforehand items to be subjected to related-term search operations or to process a document through morphological analysis. However, since the technique A depends on the sequence in which terms appear, only a related-term list is prepared depending on a sequence in which the terms appear, thus posing a problem of the user encountering considerable difficulty in understanding the related-term list prepared from the mutual information.