The present invention relates generally to natural language processing techniques to be used in computer applied systems such as word processors, machine translations and interactive systems, and more particularly to an apparatus and method of building and updating a semantic analysis co-occurrence dictionary and an apparatus and method of analyzing co-occurrences and meanings.
Recently, various computer application systems have been researched and developed on the basis of the natural language processing techniques and a portion of the various computer applied systems are gradually being fixed in our language culture. Particularly, in Japan, the progress of the kana-kanji conversion technique allows easy input of sentences, comprising a mixture of kanji and kana, to computers, whereby text processing softwares on Japanese word processors and personal computers are used widely. However, we do not still have an effective means to represent and process the meaning of words and the semantic relation between words for selecting a correct word from homonyms on the kana-kanji conversion. In the present stage, it is common practice in the machine translation or the like to process the meaning of words in accordance with the semantic analysis technique based on the case grammer described by C. J. Fillmore and to use semantic labels in the co-occurrence analysis. A description will be made hereinbelow with reference to FIGS. 8 to 13 in terms of a conventional co-occurrence analysis method using the semantic label, a conventional semantic analysis method using this conventional co-occurrence analysis method, and a conventional co-occurrence dictionary building and updating method necessary for these analysises.
FIG. 8 is a block diagram showing one example of Japanese sentence analysis apparatus based on the conventional semantic analysis method. In FIG. 8, numeral 701 represents an inputting means for inputting a sentence to be analyzed, 702 designates a morphological analysis means for dividing the inputted sentence into a list (string) of morphphemes (morphemes), 703 denotes a morphpheme dictionary to be retrieved by the morphological analysis means 702 when performing the morphological segmentation, 704 depicts a connection rule to be used by the morphological analysis means 702 when performing the connection test between the morphphemes, 705 indicates a syntactic analysis means for inputting the list (string) of morphphemes from the morphological analysis means 702 to analyze the syntactic structure and output the syntactic tree, 706 represents a context-free grammar rule to be used by the syntactic analysis means 705 when performing the syntactic structure analysis, 707 designates a semantic analysis means for inputting the syntactic tree from the syntactic analysis means 706 to perform the case analysis and output the semantic structure, 708 denotes a verbal case dictionary to be used by the semantic analysis means 707, 709 depicts a noun semantic label dictionary to be used by the semantic analysis means 707, and 710 indicates a semantic structure storing means for storing a semantic structure centering the case frame produced by the semantic analysis means 707, which is referred to and operated by an external apparatus. The noun semantic label dictionary 709 to be used for the semantic analysis describes the meaning of each of nouns within the morphpheme dictionary 703 with above one semantic label in accordance with the semantic classification standard as shown in FIG. 11 and has the contents as shown in FIG. 12. Further, the verbal case dictionary 708 divides the meaning of each of the verbs within the morphpheme dictionary 703 into one case pattern or more and describes them as illustrated in FIG. 13. As well as the noun semantic label dictionary 709, the meaning of the noun co-occuring with each case slot is described with one semantic label or more in accordance with the semantic classification standard shown in FIG. 11.
The operation of the conventional sentence analysis apparatus thus arranged will be described hereinbelow in terms of the case of analyzing the typed sentence "A B C V ". First, the typed sentence "A B C V " is supplied as a character train through the inputting means 701 to the morphological analysis means 702. The morphological analysis means 702 performs the morphological segmentation process from the beginning of the sentence toward the end of the sentence. If the morphpheme coincident with a portion of the inputted sentence train is found by the retrieval of the morphpheme dictionary 703, the connection possibility to the morphpheme immediately before the found portion is checked through the connection rule 704. If the connection is possible, the morphological segmentation process is further effected in terms of the inputted sentence train subsequent to the found portion. If a plurality of morphphemes coincident therewith are found by the retrieval of the morphpheme dictionary 703, the priority is given therebetween in accordance with a heuristic method such as the maximum coincidence and the minimum clause number. Thus, the following list (string) of morphphemes up to the end of the sentence can be obtained.
"A (noun), (case post-positional particle), B (noun), (case post-positional particle), C (noun), (case post-positional particle), V (verb), (ending of verb), (ending of verb), (past auxiliary verb)"
The aforementioned morphpheme train is supplied to the syntactic analysis means 705 so as to analyze the syntactic structure to obtain a syntactic tree as illustrated in FIG. 14. From this syntactic tree, it is understood that all of the three post-positional phrases "A ", "B " and "C " are connected or applied to the verb phrase "V ".
The syntactic tree illustrated in FIG. 14 is led to the semantic analysis means 707 so as to perform the semantic analysis of the inputted sentence in accordance with the procedure illustrated in FIG. 9 which shows a procedure for the semantic analysis of a sentence "A B C V ". First, the case patterns of the verb "V" are obtained by retrieving the verbal case dictionary 708, and the semantic labels respectively corresponding to the nouns "A", "B" and "C" are obtained by retrieving the noun semantic label dictionary 709 (step 801). Secondly, it is checked, in accordance with the co-occurrence analysis procedure illustrated in FIG. 10, whether the case slot corresponding to the semantic label of the noun of each of the post-positional phrases co-occurs with respect to each of the case patterns of the verb V. That is, only the case patterns with which all the three nouns co-occur are selected as a candidate, and further the best case pattern is selected on the basis of the priority between the case patterns, the filling degree of the case slot and others so that information such as the tense and the voice is added to the selected case pattern which is in turn outputted as the semantic structure (steps 802 to 812).
In the co-occurrence analysis procedure, as illustrated in FIG. 10 which shows a procedure of the analysis as to whether or not the noun N, being the C case, co-occurs with the case pattern P of the verb V, it is first checked whether the C case is in the case of the case pattern P (step 901). If the C case exists therein, it is checked whether there is a common semantic label between a group of semantic labels in the case slot of the C case of the case pattern P and a group of semantic labels of the noun N (step 902). If the common semantic label exists therebetween, the decision of the co-occurrence is made (step 903), and if not existing therebetween, no co-occurrence is decided (step 904). Further, if there is no C case in the cases of the case pattern P, it is checked whether the C case can be taken as the optional case such as the time and the place (step 905). If not, the decision of no co-occurrence is made (step 904). If so, the case slot information of the optional case which does not depend on the verb is retrieved so as to check whether there is a common semantic label between a group of semantic labels in the optional case slot and a group of semantic labels of the noun (step 906). If the common semantic label exists therebetween, the decision of the co-occurrence is made (step 903). On the other hand, if not existing therebetween, the decision of no co-occurrence is made (step 903).
The above-mentioned verbal case dictionary 708 and noun semantic label dictionary 709 to be used for the sentence analysis apparatus are paired so as to construct the co-occurrence dictionary. Conventionally, this construction is entirely effected by hand. A description will be made hereinbelow in terms of the typical procedure of the construction of the co-occurrence dictionary. First, one or plural specialists determine the semantic classification standard, as illustrated in FIG. 11, with reference to dictionaries, past systems and others. Secondly, one or plural workers give one or more semantic labels to each of the nouns in the morphpheme dictionary 703 on the basis of the determined semantic classification standard. Further, one or plural workers classify each of the verbs in the morphpheme dictionary 703 into one or more subsheets different in the case pattern and the regulation information such as the rule, voice and phase, and successively state the case pattern information and the other regulation information at every case subsheet as shown in FIG. 13. If the failure of the semantic classification standard has been found at the stage of the co-occurrence dictionary construction, the addition to the semantic classification standard and the change of the semantic classification standard can be performed. Further, a customary and special co-occurrence relation such as " " is directly stated as an exception in the verbal case dictionary and exception-processed prior to the aforementioned semantic analysis or after a failure of the aforementioned semantic analysis. The updating of the co-occurrence dictionary is also effected by a hand to take a matching with the construction members of the co-occurrence dictionary totally taking into account the semantic classification standard and the contents of the co-occurrence dictionary built hitherto. For a large-scale updating, the addition and change of the semantic classification standard are generally made.
There is a problem which arises with such a conventional method, however, in that there is no systematic and objective method for the construction and updating of the co-occurrence dictionary, and hence the construction and updating of the co-occurrence dictionary greatly depend upon the know-how and skill of the language specialist or the like. That is, since the building method of the semantic label system is not clear, the kind and interpretation of the semantic label are required to be set by hand of the specialist before building the noun semantic dictionary and the verbal case dictionary, and therefore the addition and change of the system are required in the actual dictionary construction and analysis because the semantic label system is rough and insufficient in kind. Further, since the interpretation of each of the semantic labels cannot be made clear, for building a large-scale dictionary by a plurality of persons, difficulty is encountered to adequately give a set of semantic labels to each word and discrepancies of interpretation occurs between the workers. In addition, in the case the end user uses a computer application system including a semantic analysis system and registers an unknown word, it is difficult that the end user understands the semantic label system of the system to adequately give semantic labels, whereby difficulty is encountered to easily update the co-occurrence dictionary by the end user.
In addition, there are several problems in accuracy of the co-occurrence analysis and semantic analysis. First, since difficulty is encountered to accurately build the co-occurrence dictionary, the semantic label is rough, and particularly the accuracy of the co-occurrence analysis between an abstract noun and the case slot thereof becomes deteriorated. For example, words pronounced as " " are above 20 in number and are abstract nouns, and hence difficulty is encountered to convert them into kanji in accordance with the conventional co-occurrence analysis. Moreover, difficulty is encountered to accurately determine the case frame, which is a principle portion of the semantic analysis, and the priority thereof.