Analogousness (similarity) in point of meaning between words is used in various use purposes in the natural language processing technology.
Namely, for example, as a technique of translation processing, there is a method in which a large number of bilingual illustrative sentences are prepared in advance to search an illustrative sentence which is most analogous to an input sentence from such bilingual illustrative sentences to modify the bilingual illustrative sentence thus searched to generate the translation sentence in which the input sentence is translated. In such a method, the analogousness between the input sentence and the bilingual illustrative sentence is calculated on the basis of the analogousness between respective words constituting such input sentence and words constituting the bilingual illustrative sentence corresponding to those words (word analogousness).
As a method of calculating the word analogousness, there are known a method using thesaurus in the form of tree or network and/or a method using co-occurrence information in the sentence of words.
In the method using thesaurus, e.g., in thesaurus, the number of arcs constituting the shortest path connecting nodes corresponding to respective two words in which the word analogousness is calculated is determined. An inverse number of the number of arcs is caused to be the word analogousness. In addition, in the method using co-occurrence information, with respect to a large number of sentences, co-occurrence information of words appearing in those sentences are registered. Thus, the word analogousness is determined on the basis of statistical quantity obtained from such co-occurrence information (statistical quantity of words easy to co-occur with respective two words in which the word analogousness attempts to be calculated).
It is to be noted that, with respect to the method of calculating the word analogousness using the thesaurus, the detail is described, e.g., in D-II, Vol. J77-D-II, No. 3, pp. 557–565, 1994, Jin Iida “Cancellation of use initiative ambiguity of modification destination of English prepositional phrase” Bulletin of Electronic Information Communication Society, and with respect to the method of calculating the word analogousness using co-occurrence information, the detail is described, e.g., in Donald Hindle, “Noun classification from predicate-argument structures”, Proceedings of Annual meeting of the Association for Computational Linguistics, pp. 268–275, 1990, etc.
Meanwhile, in the method using the thesaurus or the co-occurrence information, with respect to words which are not registered in the thesaurus or words in which co-the occurrence information are not registered (hereinafter referred to as unregistered words as occasion may demand), the word analogousness cannot be calculated. Accordingly, in the case where an attempt is made to realize abundant linguistic ability in the language processing system which carries out language processing by using thesaurus or co-occurrence information, it is necessary to carry out learning by using a vast amount of learning samples to generate a dictionary in which the thesaurus or the co-occurrence information are registered.
However, in the language processing system, it is desirable to carry out flexible and efficient learning only by lesser number of samples for learning so that abundant language ability can be realized. To realize this, it is required to calculate, also with respect to unregistered words, the word analogousness between those unregistered words and learned words, and it is also required to calculate the analogousness (word train analogousness) between a word train including unregistered words and a word train obtained from the learned grammatical rule.
On the other hand, e.g., in Naoki Fukui, “Development of minimum model-oriented to explanatory theory of language” Iwanami lecture, Science of language 6 generation grammar, Chapter 4, Iwanami bookstore, 1998, etc., it is described that an operation that the human being arranges sets of plural words in suitable order in conformity with grammar is the root of mental or psychological operation in the language ability of the human being, and elucidation of the mechanism of that psychological function is dealt as an important research theme in the theoretical linguistic science.
In addition, the realization of function to generate an arrangement of words similar to an arrangement that the human being carries out is desired also in the development of the system of realizing (simulating) (the entirety or a portion of) the language function of the human being.
However, under present situations, at the initial stage of learning in the language processing system, i.e., the stage where learning of grammatical rule is insufficient, or words to be processed are not given as a sample for learning, it is impossible to calculate the word analogousness and the word train analogousness. Thus, it is difficult to obtain a suitable arrangement of words (word train). In addition, in the case where the word trains obtained by an insufficient grammatical rule can be only outputted, representation ability of the language of the system would be restricted.
From above facts, there is required a technique in which, also with respect to words which are not registered in the dictionary obtained by learning, the word analogousness between unregistered words and words registered in the dictionary (hereinafter referred to as registered words as occasion may demand) is calculated and clustering of words based on such word analogousness is permitted to be carried out to allow learned grammar to have more generality.