1. Field of the Invention
The present invention relates to an anaphora analyzing apparatus for use in a natural language analysis, and in particular, to an anaphora analyzing apparatus for automatically estimating an anaphora referential relation or an antecedent of a noun for use in a natural language sentence, namely, estimating what a pronoun, a demonstrative pronoun or the like specifically indicates in the natural language sentence. In the present invention, targets for which the anaphora analysis should be performed are nouns including nouns, pronouns and demonstrative pronouns.
2. Description of the Prior Art
Upon estimating an anaphora referential relation of a noun in a natural language analyzing apparatus, it has been generally thought that a person previously interprets a sentence in a field to be analyzed and that he makes anaphora rules.
Moreover, because of a recent provision of an environment allowing use of a tagged corpus obtained after morphological analysis and parsing analysis, a method employing a decision tree obtained by applying a machine training method to the tagged corpus is disclosed in, for example, a prior art document 1, D. Conolly et al., xe2x80x9cA Machine Training Approach to Anaphora relationxe2x80x9d in Proceeding of NeMLaP, 1994 (referred to as a first prior art hereinafter), and a prior art document 2, C Aone et al., xe2x80x9cEvaluating Automated and Manual Acquisition of Anaphora Resolution Strategiesxe2x80x9d in Proceeding of ACL, pp. 122-129, 1995 (referred to as a second prior art hereinafter)
After manually making the anaphora rules as described above, when an actual application of the anaphora rule results in an improper estimation, it is necessary to analyze a cause of the erroneous estimation and then add or improve the anaphora rules. Thus, only an expert in the technology of natural language analyzing apparatus can, in fact, make the anaphora rules.
Moreover, the first prior art employs the method in which a better candidate is selected by comparing two candidates in sequence by use of the decision tree for selection of an antecedent. In this case, there is a possibility that the selected antecedents may be different from each other depending on the entry order of the antecedent candidates. Therefore, the first prior art has a problem in that it cannot ensure that the candidate having truly high priority is selected. Furthermore, the second prior art utilizes the decision tree for selecting the antecedent, but consideration is not given to an integration of frequency statistics and location information upon giving preferences. Therefore, the second prior art has a problem in that the accuracy of an anaphora analysis is relatively low.
When the above-mentioned tagged corpus is used, the decision tree is generated in accordance with the method of providing the tagged corpus, and thus the amount of data may affect the estimation accuracy. Moreover, the estimation may end in failure by the influence of a subtle difference in nature between the input sentence and the tagged corpus. In other words, the conventional natural language analyzing apparatus has the following problems. When constructing the anaphora analysis rules, it is necessary for the expert in the composition of the natural language analyzing apparatus to make the rules or verify the analyzed results, and thus the time and cost required for making the rules are increased. Moreover, the use of the machine training does not cause problems of time and cost, however, the machine training has problems in that estimation is normally unsuccessful due to a large or small amount of data or the differences in nature.
An object of the present invention is therefore to provide an anaphora analyzing apparatus capable of performing anaphora analysis with accuracy higher than that of the prior art.
In order to achieve the aforementioned objective, according to one aspect of the present invention, there is provided an anaphora analyzing apparatus comprising:
analyzing means for analyzing an input natural language sentence and outputting analyzed results;
storing means for storing the analyzed results outputted from the analyzing means;
antecedent candidate generating means for detecting a target component in the input natural language sentence required for anaphora analysis in accordance with the current analyzed results outputted from the analyzing means and the past analyzed results stored in the storing means, and for generating antecedent candidates corresponding to the target component;
candidate rejecting means for rejecting unnecessary candidates having no potential for anaphora referential relation among the antecedent candidates generated by the antecedent candidate generating means by using a predetermined rejecting criterion, and for outputting the remaining antecedent candidates, the rejecting criterion being of a decision tree obtained by using a machine training method in accordance with a training tagged corpus to which predetermined word information is given for each word of the training tagged corpus;
preference giving means for calculating a predetermined estimated value for each of the remaining antecedent candidates outputted from the candidate rejecting means, by referring to an information table including predetermined estimation information obtained from a predetermined further training tagged corpus, for giving the antecedent candidates preference in accordance with the calculated estimated value, and for outputting preferenced antecedent candidates; and
candidate deciding means for deciding and outputting a predetermined number of antecedent candidates based on the given preference in accordance with the preferenced antecedent candidates outputted from the preference giving means.
In the above-mentioned anaphora analyzing apparatus, the candidate rejecting means selects and outputs one or more antecedent candidates when all the antecedent candidates are rejected by the candidate rejecting means.
In the above-mentioned anaphora analyzing apparatus, the estimation information for the information table preferably includes frequency information obtained from the predetermined further training tagged corpus.
In the above-mentioned anaphora analyzing apparatus, the estimation information for the information table preferably have been known to those skilled in the art, and then, generates a tagged corpus including tags such as information about a part of speech of a word and information about a relation between a relative and a noun, which are analyzed results. Thereafter, the analyzer 1 stores the analyzed results in an analyzed result memory 11, and outputs the analyzed results to the antecedent candidate generator 2. In the present preferred embodiment, the tagged corpus is provided with word information such as regular expression, part of speech, semantic code, like gender, person and number for each word. Next, the antecedent candidate generator 2 detects a target component in the input sentence required for anaphora analysis in accordance with the analyzed results of the input tagged corpus by referring to the tagged corpus of the past analyzed results stored in the analyzed result memory 11, and also generates antecedent candidates corresponding to the target component, and outputs the antecedent condidates to the candidate rejecting section 3. Succinctly speaking, the antecedent candidate generator 2 extracts the nouns from the input tagged corpus and the past tagged corpuses by using a known method, so as to generate the antecedent candidates that are the nouns as considered to have anaphora referential relation.
In the above-mentioned anaphora analyzing apparatus, the estimation information for the information table preferably includes predetermined information calculated in accordance with frequency information obtained from the predetermined further training tagged corpus and a distance between a target component for anaphora analysis and antecedent candidates obtained from the predetermined further training tagged corpus.