1. Field of the Invention
The present invention relates to a computerized information retrieval apparatus, and more particularly to a keyword extraction apparatus for extracting, from text data stated in Japanese, keywords, i.e., terms useful for retrieving information from those Japanese text data.
2. Description of the prior art
In compiling Japanese text data into a database, keywords are extracted from the Japanese text data and the extracted keywords are assigned to the Japanese text data as annexed information. An increase in the volume of Japanese text data to be compiled into a database necessitates more efficient extraction and assignment of such keywords. To meet this need, there have been developed keyword extraction apparatuses for computerized automatic extraction of keywords. Such apparatuses are known as keyword extraction systems.
Conventional apparatuses of this kind for extracting keywords from Japanese texts include, for example, the keyword extraction apparatus disclosed in the Japanese Koukai No. 3-135669, which extracts keywords according to the presumed importance of each keyword based on the frequency of its appearance.
Referring to FIG. 20, this conventional keyword extraction apparatus for Japanese texts performs the following processing.
First, the Japanese text data to be processed are divided into individual words (step 20-1). This processing is known as processing into sentence segmentation. For this division into words, there is used a dictionary for word-by-word division.
Next, phrases are segmented on the basis of words or the like which mark off `bun-setsu` (this Japanese term is substantially equivalent to "phrase", and will be hereinafter referred to as "phrase" or "phrases" instead) (step 20-2), and phrases which seem to be the "subject phrase" the "object phrase" and the like are extracted as "important phrases" (step 20-3). As the method to identify "important phrases" the notation of the postpositional particle at the end of each phrase is taken note of. More specifically, phrases ending with `` ("ga"), `` ("wa"), `` ("wo"), `` ("de"), `` ("ya") or `` ("mo") in Japanese `hiragana` (one of the two syllabaries) are regarded as "important phrases".
Then, out of the phrases extracted as "important phrases", keywords are extracted (step 20-4). More specifically, nouns in the "important phrases" and nouns having appeared twice or more in the Japanese text data to be processed are extracted as keywords. However, nouns identified as unnecessary words (such as nouns consisting of one character each, or wholly of numerals, or including `hiragana` letters) are excluded from keywords.
The extracted keywords are weighted according to the importance of each (step 20-5) to be narrowed down. Thus, the degrees of importance of the keywords in the Japanese text data, extracted at step 20-4, are calculated according to the appearing frequency and position of each, and only those keywords whose degrees of importance are above a certain level are selected as true keywords.
In the above-described keyword extraction apparatus according to the prior art, the notations of the Japanese postpositional particles or the like are taken note of in extracting "important phrases" from Japanese text data. More specifically, such postpositional particles as ``, ``, ``, ``, `` and `` are looked for, and if any of these particles is at the end of a phrase, that phrase is identified as an "important phrase".
By this technique, in a Japanese clause `` (computer "de syori suru") meaning "processed by a computer", the phrase `` (computer "de") having the postpositional particle `` ("de") is identified as an important phrase. On the other hand, in another Japanese clause `` (computer "niyotte syori suru") also meaning "processed by a computer" the phrase `` (computer "niyotte") is not perceived as an "important phrase" because `` ("niyotte") is not a postpositional particle. In a Japanese sentence `` (computer "ga" data "wo syori suru") meaning "A computer processes data," since this is an active sentence, (computer "ga") is identified as an "important phrase". However, in another Japanese sentence in the passive voice `` meaning "Data are processed by a computer," i.e., having a similar meaning to the active sentence quoted above, `` (data "ga") is identified as an "important phrase" but `` (computer "niyotte") is not.
In this manner, the conventional keyword extraction apparatus takes note of Japanese notations in identifying "important phrases". For this reason, it involves the problem that phrases of similar meanings appearing in sentences of similar meanings are sometimes extracted as keywords and at other times not.
Moreover, the conventional keyword extraction apparatus calculates the degrees of importance according to the frequencies of appearance in extracting keywords. In extracting keywords according to the frequencies of appearance, phrases with higher frequencies of appearance should be generally selected as keywords. Therefore, in the conventional keyword extraction apparatus, if a given phrase appears more than once, a value representing its importance is cumulatively added every time it appears, and the degree of importance of that phrase is calculated accordingly. For instance in a Japanese sentence `` ("Nihon no kawa no naka de ichiban nagai kawa wa Shinano gawa desu") meaning "The longest river in Japan is the Shinano River," the word `` ("kawa" or "gawa") meaning "river" appears three times in total. However, as `` ("Shinano gawa") means "the Shinano River" counting the appearance of `` three times results in over-estimation of the importance of this word.
Thus, the keyword extraction apparatus according to the prior art takes note of simple frequencies of appearance in calculating the degrees of importance with the consequence that importance is assessed irrespective of the theme of the text.
Furthermore, the conventional keyword extraction apparatus calculates the degrees of importance of keywords according to their frequencies of appearance irrespective of the length of the Japanese text data to be processed. For instance, where certain keywords appear in Japanese text data, higher degrees of importance are assigned to keywords with higher frequencies of appearance irrespective of whether the Japanese text data consist of 100 words of 1,000 words. However, the quantity of information represented by Japanese text data is generally considered to expand with an increase in the length of the Japanese text data. Therefore, the length of the Japanese text data is regarded as being proportional to the number of words constituting the data. Thus, the same keyword is considered to contain more important information, irrespective of the frequency of its appearance, when it appears in Japanese text data consisting of 100 words than when appearing in Japanese text data comprising 1,000 words.
As stated above, the conventional keyword extracting apparatus, since it treats the frequencies of appearance of keywords irrespective of the length of Japanese text data, involves the problem that the calculated degrees of importance of keywords may not represent their real importance.
An object of the present invention is to provide, in view of the problems pointed out above, a keyword extracting apparatus for automatically extracting keywords on the basis of the frequency of appearance of each term in Japanese text data and information indicating the meaning of each term in the Japanese text data, and extracting keywords accurately representing the theme of the text.