1. Field of the Invention
The invention relates to technology for performing language analysis of sentence data using examples of phrases and sentences. More specifically, the present invention relates to a method for predicting negative examples from positive examples, a method for detecting an incorrect wording using the predicted negative examples, or extracting non-case relational relative clause (relative clauses that do not have correct case relations with respective main clauses) from a sentence.
The present invention can be applied to determine case relationships occurring in a sentence, wording errors and syntax analysis, but it is by no means limited to these applications. The present invention can therefore be used, for example, in the detection of incorrect wording of actual Japanese sentences and can be applied together with word processor systems and OCR reading systems.
2. Description of the Related Art
A process for predicting negative examples from positive examples is a process that takes correct phrases or sentences as positive examples, takes incorrect phrases or sentences as negative examples, and predicts negative examples from the positive examples. Positive examples can be acquired relatively easily by utilizing a corpus (i.e., a collection of correct Japanese sentences) or the like, but negative examples cannot be easily acquired. A process for generating such negative examples can only be performed manually so that such a negative examples creation process tends to be excessively work intensive.
In a simple method for predicting a negative example from positive examples, input examples that do not appear in known positive examples are all considered to be negative examples.
However, in reality, the existence of positive examples that are not yet among the positive examples should be considered. If negative examples are predicted using this kind of simple method, there is a problem that a large number of new positive examples are determined to be negative examples. It is therefore not possible to apply negative examples generated using this method to highly precise processing.
A method is therefore required for predicting negative examples from a large number of positive examples. For example, assuming that all data of a large scale existing corpus (for example, a collection of Japanese sentences) to be correct, then all the sentences existing in the corpus can be considered to be correct sentences (positive examples). Negative examples can then be automatically generated by using a method where each of the sentences in the corpus are used as positive examples and processing is carried out to predict negative examples that are incorrectly worded. In the implementation of a processing method for predicting negative examples from positive examples, the detection of actual incorrectly worded sentences is useful when there are positive examples but the acquisition of negative examples is difficult.
For example, a process for detecting incorrect wording in Japanese sentences is extremely difficult compared to the case of English sentences. A space is left between words in English sentences, so that spell-checking of the words can be carried out to a substantially high precision by basically preparing a word dictionary and rules for the changing of word endings. However, in the Japanese sentences, the words are connected and a high-precision result is hard to be achieved even with processing limiting the incorrect wording.
Further, in addition to incorrect wording, grammatical errors, such as errors in usage of particles such as “te()”, “ni()”, “wo()”, and “ha(ha)” may also exist. Wording errors based on grammatical errors are difficult to detect, regardless of whether the sentences are Japanese sentences or English sentences.
The following is related prior art for detecting wording errors in the Japanese language.
Related methods for detecting incorrect wording based on word dictionaries, a dictionary listing a succession of hiragana, and a dictionary listing articulation conditions are described in the following cited references 1 to 3. In these related methods, incorrect wording is determined when a wording appears that is not listed in the word dictionary or the dictionary listing a succession of hiragana, or in the case of the appearance of an articulation that is not sufficiently covered by the articulation conditions listed in the dictionary.
[Cited reference 1: , ,  (Kazuhiro Nohtom, Development of Proofreading Support Tool hsp, Information Processing Institute, Research and Development Presentation (digital documents)), pp. 9-16, (1997)]
[Cited reference 2: ,  ,  (Kawahara et al., Methods of Detecting Incorrect Wording Using a Dictionary Extracted from a Corpus, 54th National Conference of the Information Processing Society), pp. 2-21-2-22, (1997)]
[Cited reference 3: ,  ,  (Nobuyuki Shiraki et al., Making a Japanese Spellchecker by Registering Large Volumes of Strings of Hiragana, Annual Conference of the Language Processing Society, pp. 445-448, (1997))
Also, a related art where probabilities of occurrence are obtained for each character string based on a probability model utilizing n-gram of a character unit, with locations where character strings for which the probability of occurrence is low then being determined to be incorrect wordings is disclosed in the following cited references 4 to 6.
The technique using n-gram probability in cited reference 5 below is used in the detection of wording errors occurring in error correction systems mainly for optical character readers (OCRs). In the case of the OCR error correction system, assuming that the probability of appearance of incorrect wording is high at 5 to 10%, this is higher than the probability of a person writing would usually have of making a mistake. This is a relatively straightforward problem, and the recall rate and relevance rate for the detection of wording errors can therefore easily become high.
[Cited reference 4:  2 ,  (Tetsuro Araki et. al., Detection and Correction of Errors in Japanese Sentences Using Two Kinds of Markov Model, Information Processing Institute, Natural Language Processing Society), NL97-5, pp. 29-35, (1997)]
[Cited reference 5: , A, n-g r a m o c r ,  (Takaaki Matsuyama, et. al., A Thesis on Experiments Relating to Estimation of Relevance Rate and Recall Rate for Evaluating Performance in OCR Error Correction Using n-gram, Information Processing Society, Annual Conference), pp. 129-132, (1996)]
[Cited Reference 6: , ,  (Koichi Takeuchi et. al., OCR Error Correction Using Stochastic Language Models, Information Processing Society Journal), Vol. 40, No. 6, (1999)].
The method of the related art by Takeuchi et. al. considered to be the most appropriate, i.e. the related art disclosed in cited reference 6 (hereinafter referred to as related art A) is briefly described in the following.
In related art A, first, the text for which it is wished to detect incorrect wording is extracted one character at a time from the top so as to extract three consecutive characters. When the probability of appearance of the extracted portion in the corpus (collection of correct Japanese sentences) is Tp or less, −1 is associated to these three consecutive characters, and characters for which the provided value is Ts or greater are then determined to be incorrect. For example, Tp is taken to be zero, and Ts is taken to be −2. By making Tp zero, it is sufficient simply to check whether or not these three consecutive characters appear in the corpus without it being necessary to expressly obtain the probability of appearance. When Tp>0, an error is determined even if the extracted portion appears in the corpus. However, if the characters appear in the corpus even if the probability of appearance is low, then this is taken not to be an error and it is therefore preferable to set Tp=0 rather than Tp>0.
As a supplement to related art A, a description is given of processing for carrying out error detection on Japanese expressions referred to as “fu no jirei no kenshutsu”(). At this time, the three consecutive characters of “fu no koto”() and “no jirei”() are allocated from the top of the Japanese expression, a check is made as to whether this is in the corpus, and −1 is assigned to these three characters if the allocated three characters are not present in the corpus. In this case, as there is no “nojirei”() or “jireino”(), points are assigned according to the trigram shown in FIG. 18, and the portion for “ji”() and “rei”() that is assigned “−2” is determined to be erroneous. The related art method A is therefore a method where a 3-gram character appearing in the corpus with a high frequency can be efficiently combined to detect errors.
However, the processing in the related method A is a process for determining whether or not this expression exists in the corpus. This is to say that the related method A is similar to the other aforementioned related methods in that items that do not appear in the dictionary are taken to be errors.
Next, a description is given of technology for extracting non-case relational relative clauses. A non-case relational relative clause refers to where a verb for an attributive modifying clause and a noun for an element subject to a modifier constitute a sentence with no case relationship, and where a case relationship between a verb for a clause of an embedded sentence and its preceding relative noun is not established.
A sentence “fu no jirei wo chushutsu suru koto wa muzukashii”() ) is taken as an example. In the relative clause “fu no jirei wo chushutsu suru koto”(), a case relationship such as “koto ga chushutsu suru”() or “koto wo chushutsu suru”() and such is not established between the verb of “chushutsu suru”() and the preceding noun “koto”(). Namely, this is taken to be a non-case relational relative clause because there is no case relationship such as a “ga()” case or a “wo()” case between “chushutsu suru”() and “koto”(). Conversely, sentences for which case relationships can be established are referred to as sentences for internal relationships.
In addition to the aforementioned format, sentences also have complex structures such as “sanma wo yaku kemuri”(). When an attributive modifying clause in the case relationship is taken to be a positive example, the sentence for the non-case relational relative clause is taken to be a negative example. A large number of declinable words (for example, verb) in the case relationship and nouns exist within the corpus. Therefore, from the present invention, when a non-case relational relative clause for this negative example is predicted taking this information as a positive example, the non-case relational relative clause taken as an negative example can be automatically extracted from verbs and nouns in each case relationship taken as a positive example.
The methods disclosed in the following cited references 7 to 9 are also provided as related methods for extracting sentences for non-case relational relative.
[Cited reference 7: , ,  (Takeshi Abekawa, et. al., Analysis of Root Modifiers in the Japanese Language Utilizing Statistical Information, Annual Conference of the Language Processing Society), pp. 270-271, (2001)]
[Cited reference 8: Timothy Baldwin, Making Lexical Sense of Japanese-English Machine Translation: A Disambiguation Extravaganza, Technical Report, (Tokyo Institute of Technology, 2001), Technical Report, pp. 69-122, ISSN 0918-2802]
[Cited reference 9: ,  (Katsuji Omote, Japanese/English Translation Systems for Embedded Sentences, Tottori University graduation thesis), (2001)]
In the related art of cited reference 7, using the attributive modifier relationship and the case relationship, it is noted that there are large differences in the distribution of different numbers of verbs making up these relationships, and sentences for non-case relational relative are then specified by evaluating differences in this distribution using a K-L distance. Further, in cited reference 8, from research into using a method where nouns that easily become non-case relational relative clauses with respect to embedded clauses etc. are extracted with manual rules then utilizing this information, a method is cited where non-case relational relative clauses are specified using supervised machine learning techniques taking a wide range of information included in case frame information as attributes. The technique for cited reference 9 is a technique for determining whether a clause is non-case relational relative or case relational relative using case frame information in order to translate embedded sentences from Japanese to English.
Further, it is well know that learning is typically difficult using just positive examples, as is described in the following with reference to cited reference 10. If the machine learning method is a method using both positive examples and negative examples as supervised data (teaching signals), more highly precise processing is anticipated but precision of processing with machine learning methods only using positive examples is considered a problem.
[Cited reference 10: , , (Takashi Yokomori et. al., Learning of Formal Languages Centered on Learning from Positive examples, Information Processing Society Journal), Vol. 32, No. 3, pp. 226-235, (1991)]
As described above, in a process for predicting negative examples from positive examples, it is desirable to have a practical method for which precision is high.
In the related art methods using machine learning taking only positive examples as teaching signals, high precision processing is not achieved and the acquisition of negative examples as teaching signals is difficult. Processing for detection of incorrect wording of passages is then implemented by utilizing machine learning taking both positive examples and negative examples as teaching signals.