1. Field of the Invention
The present invention relates to a keyword extraction apparatus, a keyword extraction method and a computer readable recording medium storing a keyword extraction program, which are used in a system for retrieving a document written in natural language to automatically extract keywords from the document beforehand for creating an index of the document in terms of keywords and, at the time of retrieval, to extract a keyword from an input sentence for retrieving the document through collation of the keyword.
2. Description of the Related Art
As a method of retrieving documents in electronic form, it has been hitherto known to previously assign keywords to a document in the form of an index and, at the time of retrieval, to search the document by collating a designated keyword with the keywords assigned to the document. This method has problems in that manually assigning keywords to a document requires a lot of time and labor, and the retrieval cannot work if the keywords assigned by a person who has engaged in creating the index differ from keywords designated by persons who are going to perform retrieval.
For lessening time and labor required to assign keywords, methods of automatically extracting keywords from documents in electronic form have been proposed.
FIG. 64 is a block diagram showing a conventional keyword extraction system disclosed in, for example, Japanese Unexamined Patent Publication No. 8-30627. In FIG. 64, denoted by 6401 is a character type discriminating portion for discriminating types of individual characters in an input text and then transferring the discriminated types to character type storage means 6402. The character type storage means 6402 stores the types and corresponding positions of the individual characters in the input text which have been discriminated by the character type discriminating portion 6401. Denoted by 6403 is an effective-character-type character string cutting portion for cutting out all effective-character-type character strings, each of which is as long as any of four effective character types, i.e., katakana (the square form of Japanese letters hiragana), kanji (Chinese characters), alphabets and numerals, continue, based on the information stored in the character type storage means 6402.
Denoted by 6406 is a character-type boundary discriminating portion for discriminating all boundary positions between different character types of all the effective-character-type character strings based on the information stored in the character type storage means 6402, and then transferring the discriminated positions to character-type segmentation point storage means 6407. The character-type segmentation point storage means 6407 stores every boundary position, at which the character type changes from one to another, discriminated by the character-type boundary discriminating portion 6406.
Denoted by 6409 is affix storage means for storing affixes of high frequency. 6410 is an affix discriminating portion for discriminating all affixes in a character string and then transferring the discriminated affixes to affix segmentation point storage means 6411. The affix segmentation point storage means 6411 stores, as affix segmentation points, positions before and behind all the affixes discriminated by the affix discriminating portion 6410.
Denoted by 6413 is basic word storage means for storing, as basic words, nouns of high frequency. 6414 is a basic-word discriminating portion for discriminating all basic words in a character string and then transferring the discriminated basic words to basic-word segmentation point storage means 6415. The basic-word segmentation point storage means 6415 stores, as basic-word segmentation points, positions before and behind all the basic words discriminated by the basic-word discriminating portion 6414.
Denoted by 6412 is a partial-character-string cutting portion for cutting out partial character strings based on the character-type segmentation points stored in the character-type segmentation point storage means 6407, the affix segmentation points stored in the affix segmentation point storage means 6411, or the basic-word segmentation points stored in the basic-word segmentation point storage means 6415.
Denoted by 6404 is a noun discriminating portion which, when a character succeeding each of the effective-character-type character string cut out by the effective-character-type character string cutting portion 6403 is hiragana, compares the hiragana with hiragana character strings stored in noun-succeeding-hiragana storage means 6405, and then deletes the effective-character-type character string when a head portion of the hiragana succeeding to that effective-character-type character string does not match with any of the hiragana character strings stored in the noun-succeeding-hiragana storage means 6405.
Denoted by 6416 is a basic-word deleting portion for deleting the partial character string which matches with any of the basic words stored in the basic word storage means 6413.
Denoted by 6417 is a necessary keyword storage means for storing keyword character strings designated beforehand. 6418 is a necessary keyword cutting portion which, when character strings matching with the character strings stored in the necessary keyword storage means 6417 appear in a text, cuts out all those character strings and adds them to keywords.
The operation of the conventional keyword extraction system will be described below. The description will be made on the case of entering a text "{character pullout} (oekaki mohdo=painting mode)", for example.
First, the character type discriminating portion 6401 discriminates types of individual characters in an input text, and the character type storage means 6402 stores the types and corresponding positions of the individual characters in such a way that the first character is hiragana, the second character is kanji, the third character is kanji, the fourth character is hiragana, and so on.
Next, the effective-character-type character string cutting portion 6403 cuts out "{character pullout}" and "{character pullout}". Since there are no differences in character type within the partial character strings of "{character pullout}" and "{character pullout}", character-type segmentation points are not stored in the character-type segmentation point storage means 6407. Also, since no affixes are included in the partial character strings of "{character pullout}" and "{character pullout}", affix segmentation points are not stored in the affix segmentation point storage means 6411. Further, since no basic words are included in the partial character strings of "{character pullout}" and "{character pullout}", basic-word segmentation points are not stored in the basic-word segmentation point storage means 6415.
Then, since "{character pullout}" and "{character pullout}" do not include any of the character-type segmentation point, the affix segmentation point and the basic-word segmentation point, the partial-character-string cutting portion 6412 eventually cut outs two partial character strings of "{character pullout}" and "{character pullout}".
Subsequently, since hiragana "{character pullout}" succeeding to "{character pullout}" is not registered in the noun-succeeding-hiragana storage means 6405, the noun discriminating portion 6404 deletes "{character pullout}". On the other hand, since there is no hiragana succeeding to "{character pullout}", "{character pullout}" is not deleted in the noun discriminating portion 6404. The basic-word deleting portion 6416 then deletes the basic word which matches with any of those stored in the basic word storage means 6413. If "{character pullout}" is assumed here not to be a basic word, "{character pullout}" would not be deleted.
Next, the necessary keyword cutting portion 618 cuts out "{character pullout}" from the text "{character pullout}" stored in the necessary keyword storage means 6417 and adds it to keywords. Finally, "{character pullout}" and "{character pullout}" are output.
When "{character pullout}" or "{character pullout}" is designated as a retrieval key at the time of retrieval, the document including the original text "{character pullout}" is retrieved.
In retrieval with the thus-constructed keyword extraction system disclosed in Japanese Unexamined Patent Publication No. 8-30627, the retrieval is hit only when the character string designated as a keyword completely matches with any of the keywords assigned to a document. In retrieval, however, words having the similar meaning and pronunciation but different expressions (in written language) must be often taken into account. For example, "{character pullout} (oekaki=painting)" may be entered as a retrieval key rather than "{character pullout}" at the time of retrieval. Thus the keyword extraction system disclosed in Japanese Unexamined Patent Publication No. 8-30627 has a problem that retrieval cannot be effected unless there is a complete match between character strings.
To cope with the problem caused by words having the similar meaning and pronunciation but different expressions, a document retrieval method and apparatus are proposed in Japanese Unexamined Patent Publication No. 8-137892. In the document retrieval method and apparatus proposed in Japanese Unexamined Patent Publication No. 8-137892, when a character string designated upon retrieval is a compound word, the compound word is divided into individual words composing it and synonym expressions for the compound word are created in combinations of synonyms for each of the divided words by using a synonym dictionary.
FIG. 65 is a block diagram of the conventional document retrieval method and apparatus disclosed in Japanese Unexamined Patent Publication No. 8-137892. In FIG. 65, denoted by 6501 is a control unit comprised of a CPU and memory, 6502 is an input unit such as a keyboard or mouse through which the user enters a retrieval keyword and performs retrieval operation, 6503 is a display unit for displaying the retrieval keyword entered through the input unit 6502, the retrieval operation instructed by the user, and retrieved results, 6504 is an external storage unit for storing data to be retrieved, 6505 is a synonym dictionary in which synonym information for retrieved keywords is stored, and 6506 is a segmentation dictionary in which the retrieved keywords are stored. A character string designated for retrieval is segmented based on words registered in the segmentation dictionary 6506.
The operation of the conventional document retrieval method will be described below. FIG. 66 is a flowchart illustrating a flow of processing disclosed in Japanese Unexamined Patent Publication No. 8-137892. The following description will be made on the case of designating, for example, "{character pullout} (bunsbo kensaku=document retrieval)*{character pullout} (wahku sutehshon=work station)" (where "*" indicates logical product) as a retrieval formula. It is assumed that "{character pullout}" and "{character pullout}" are registered in the segmentation dictionary. Also, the synonym dictionary is assumed to store such information that "{character pullout}" and "{character pullout} (tekisuto=text)" are synonyms, "{character pullout}" and "{character pullout} (sahchi=search)" are synonyms, and "{character pullout}" and "WS" are synonyms.
In step 6612, a value in a synonym-dictionary usage flag buffer to set whether to use the synonym dictionary or not is checked. Assuming here that the buffer value is set to "1" indicating the use of the synonym dictionary, the processing follows the path indicated by at Y.
Next, in step 6613, the retrieval formula is segmented into a character string to be retrieved and a logical formula. Then, in step 6614, the character string to be retrieved is compared with words in the segmentation dictionary for segmentation of a keyword. Subsequently, in step 6615, synonyms which correspond to each of the segmented keywords are extracted from the synonym dictionary.
It is determined in step 6616 whether or not the processing for all keywords has been completed, and the processing of steps 6614 and 6615 is repeated until all keywords are processed.
Next, in step 6617, the synonyms corresponding to the segmented keywords are combined with each other to create retrieval keywords.
Subsequently, in step 6618, the created retrieval keywords are joined by putting logical sum ("+") between adjacent two. As a result, for "{character pullout}", a retrieval formula "({character pullout}+{character pullout}+{character pullout}+{character pullout}") is created in step 6619.
It is then checked in step 6620 whether or not a logical formula storage buffer is empty. The processing now returns to step 6614 to repeat the similar processing as explained above for the next character string to be retrieved, i.e., "{character pullout}".
For "{character pullout}", a retrieval formula "({character pullout}+WS)" is created in step 6619.
Although it is checked in step 6620 whether or not the logical formula storage buffer is empty, the processing now follows the path indicated by Y because there is no more retrieved character string to be processed. As a result, for the designated retrieval formula "{character pullout}* {character pullout}", "{character pullout}+{character pullout} {character pullout}+{character pullout}+{character pullout}" * ({character pullout}+WS)" is created as a retrieval formula for use in actual retrieval.
However, the document retrieval method and apparatus disclosed in Japanese Unexamined Patent Publication No. 8-137892 are designed to perform retrieval for character strings created by all possible combinations of different expressions, and hence have a problem that a longer time is required for retrieval as the number of combinations increases.
As another related art for creation of different expressions, Japanese Unexamined Patent Publication No. 3-15980 discloses a different expression and synonym developing method.
FIG. 67 is a block diagram of the different expression and synonym developing method for retrieval of character strings which is disclosed in Japanese Unexamined Patent Publication No. 3-15980. In FIG. 67, denoted by 6711 and 6713 are conversion rule tables for storing conversion rules which instruct a relevant character string in an input character string to be replaced by another character string, and 6712 is a synonym dictionary in which words having the similar meaning but different expressions are collected. Denoted by 6700 is a keyboard, 6701 and 6703 are different expression developing processes for developing a character string into character strings having the similar pronunciation and meaning but different expressions, and 6702 is a synonym developing process for developing a character string into character strings having the similar meaning by using a synonym dictionary 6712.
FIG. 68 shows an outline of the different expression and synonym developing process. A character string 6801 designated by the user is once subjected to different expression development, and a synonym development is then performed on a group of developed character strings 6802 by using the synonym dictionary 6712. After that, another different expression development is performed on a group of character strings 6803 resulted from the synonym development, whereby a group of character strings 6804 is obtained as a final development result. An example of FIG. 68 represents the case where the user designates a character string "{character pullout} (takujougata intafohn=desktop interphone)" on condition that each of the conversion tables stores rules for converting "{character pullout}(foh)" into "{character pullout} (ho)" and "{character pullout} (gata)" into "{character pullout} (gata)", and the synonym dictionary stores information that "{character pullout}" and "{character pullout}" are synonyms.
Thus, the method disclosed in Japanese Unexamined Patent Publication No. 3-15980 is designed to avoid a retrieval omission by developing various representations of different expressions and synonyms. However, because the disclosed method creates all possible different expressions, it is required to collate an input character string with all the different expressions created by the above-mentioned processing in order to determine whether or not there occurs a match for each word.
The conventional keyword extraction methods for use in retrieval of documents have had problems below because of their constructions described above.
First, in such a conventional automatic keyword extraction process as disclosed in Japanese Unexamined Patent Publication No. 8-30627, character strings appearing in a sentence to be processed are cut out, as they are, to be used as keywords which are assigned in the form of an index to a document. The conventional automatic keyword extraction process cannot therefore perform retrieval for words having the similar meaning and pronunciation but different expressions.
Although techniques to permit retrieval for words having similar meaning and pronunciations but different expressions are disclosed in Japanese Unexamined Patent Publication No. 8-137892 and No. 3-15980, those techniques require a word designated for retrieval to be collated with all possible combinations of individual words composing the designated word which have the similar pronunciation and meaning but different expressions. Thus, there has been a problem that a long time is required for retrieval processing.
Assuming, for example, that words having the similar meaning and pronunciation but different expressions are "{character pullout} (sahbah=server)" for "{character pullout} (sahba=server)" and "{character pullout}", "{character pullout}", "{character pullout}" for "{character pullout}" (each kirikae=switching), a total of eight keywords, i.e., "{character pullout}", "{character pullout}", "{character pullout}", "{character pullout}", "{character pullout}", "{character pullout}", "{character pullout}", and "{character pullout}" have been created and collated for a keyword "{character pullout}".
Secondly, where a keyword contains a word which succeeds to a prefix and has different expressions, it has been required to create all combinations of the presence/absence of the prefix and the different expressions of the word succeeding to the prefix, and then collate an input keyword with all those combinations.
Assuming, for example, that there are three words having the similar meaning and pronunciation but different expressions, i.e., "{character pullout}", "{character pullout}" and "{character pullout}", for "{character pullout}" (each kirikae=switching), a total of eight keywords, i.e., "{character pullout}", "{character pullout}", "{character pullout}", "{character pullout}", "{character pullout}", "{character pullout}", "{character pullout}", and "{character pullout}" "{character pullout}" have been created and collated for a keyword "{character pullout} (zenkirikae=full switching". Thus, the necessity of collating an input keyword with all of the created keywords has raised a problem that a long time is required for retrieval processing.
Thirdly, where a keyword contains a word which precedes a suffix and has different expressions, it has been required to create all combinations of the presence/absence of the suffix and the different expressions of the word preceding the suffix, and then collate an input keyword with all those combinations.
Assuming, for example, that there are three words having the similar meaning and pronunciation but different expressions, i.e., "{character pullout}", "{character pullout}" and "{character pullout}", for "{character pullout}" (each kirikae=switching), a total of eight keywords, i.e., "{character pullout}", "{character pullout}", "{character pullout}", "{character pullout}", "{character pullout}", "{character pullout}", "{character pullout}", and "{character pullout}" have been created and collated for a keyword "{character pullout} (kirikaego=after switching". Thus, the necessity of collating an input keyword with all of the created keywords has raised a problem that a long time is required for retrieval processing.
Fourthly, the conventional automatic keyword extraction process as disclosed in Japanese Unexamined Patent Publication No. 8-30627 is designed to set a limit in length of keywords and deleted the keywords which have a length beyond the limit. However, such a design employed in the process disclosed in Japanese Unexamined Patent Publication No. 8-30627 may cause a problem of uneven keyword extraction that, for keywords which have the similar meaning but different expressions and which are different in length, some keywords are extracted, but other keywords are deleted.
Assuming, for example, that "{character pullout} (konpyubta=computer)" and "{character pullout} (konpyuhtah=computer)" are registered as words having the similar meaning and pronunciation but different expressions, and a limit of the keyword length is set to be less than 15 characters, "{character pullout}{character pullout} (konpyubta ahkitekuchah=computer architecture)" is extracted, but "{character pullout}{character pullout} (konpyuhtah ahkitekuchah=computer architecture)" is deleted.
Stated otherwise, when combinations of a compound word are created in accordance with the method disclosed in Japanese Unexamined Patent Publication No. 8-137892 to cope with retrieval for words having the similar meaning and pronunciation but different expressions, there has been a problem of uneven keyword extraction that, even upon the same retrieval key being designated, documents containing "{character pullout}{character pullout}" are retrieved, but documents containing "{character pullout}" {character pullout} are not retrieved.
Fifthly, with the conventional keyword extraction process disclosed in Japanese Unexamined Patent Publication No. 8-30627, because character strings appearing in a sentence to be processed are cut out, as they are, to be used as keywords, words having the similar meaning and pronunciation but different expressions are extracted as separate words. Accordingly, there has been a problem that precise frequency totalization which is necessary for, e.g., a keyword weighting process, cannot be achieved for the words having the similar meaning and pronunciation but different expressions.
Sixthly, in compound words such as "{character pullout}. {character pullout} (yuza intafehsu=user interface), for example, symbolic characters such as ".cndot." and "/" may be put between individual words composing the compound word; e.g., "{character pullout}. {character pullout}" and "{character pullout}. {character pullout}", in addition to different expressions for each of the individual words composing the compound word; i.e., "{character pullout}" and "{character pullout}". It is therefore required to unify the expression format for compound words.
The conventional keyword extraction process disclosed in Japanese Unexamined Patent Publication No. 8-30627 includes a method of deleting ".cndot." and "/" to unify the expression format for compound words, but it cannot deal with different expressions for each word which have the similar meaning and pronunciation, as described above. Also, Japanese Unexamined Patent Publication No. 8-137892 and No. 3-15980 disclose methods of creating combinations of different expressions for each word which have the similar meaning and pronunciation, but cannot deal with a process needed to unify the expression format for compound words. Accordingly, even if the above conventional techniques are combined with each other, an input keyword must be collated with all possible combinations of different expressions of individual words composing a compound word; hence a problem of requiring a long time for retrieval processing still remains.
Assuming, for example, that "{character pullout} (yuhza=user)" has a different expression "{character pullout} (yuhzah=user)" which has the similar meaning and pronunciation, and "{character pullout} (intafehsu=interface)" has a different expression of "{character pullout} (intafeisu=interface)", four expressions "{character pullout}", "{character pullout}","{character pullout}{character pullout}", "{character pullout}{character pullout}", and "{character pullout}{character pullout}" would be produced for "{character pullout}. {character pullout}" even if the above conventional techniques are combined with each other. Accordingly, a problem of requiring collation with all those different expressions is encountered.
Seventhly, in the methods disclosed in Japanese Unexamined Patent Publication No. 3-15980 and No. 8-137892, different expressions of a retrieval key, which have the similar meaning and pronunciation, are created at the time of retrieval in combinations of different expressions for each word and character string. As a result, a large number of retrieval keys to be collated are produced and a retrieval speed is reduced.
Furthermore, the methods disclosed in Japanese Unexamined Patent Publication No. 3-15980 and No. 8-137892 have a risk that an improper retrieval key may be produced when replacing a short word, in particular. For example, because the method disclosed in Japanese Unexamined Patent Publication No. 3-15980 holds a rule that "{character pullout} (tah)" is a different expression of "{character pullout} (ta)", "{character pullout} (intahfohn=interphone)" is created as a different expression of "{character pullout} (intafohn=interphone)" in the step of creating a different expression of "{character pullout} (intafohn=interphone)". However, the rule that "{character pullout} (tah)" is a different expression of "{character pullout} (ta)" can be applied to "{character pullout}", but not to "{character pullout} (takushih=taxi)", for example. It is therefore demanded to avoid a short word and store a relatively long word, such as a compound word, as information in a different expression dictionary used for replacement of one to another of different expressions. Hitherto, there have been no techniques to assist construction of a different expression dictionary responding to such a demand. As a result, a number of retrieval keys are produced and a problem that a keyword extraction method for realizing a high-speed document retrieval cannot be achieved has been encountered.