1. Field of the Invention
The invention relates to a system and a method for analyzing language using a supervised machine learning method. The present invention can be applied to an extremely wide range of problems including processing for generating phraseology such as ellipsis supplemented processing, sentence generation processing, machine translation processing, character recognition processing and speech recognition processing etc, which enable the use of a language processing system which is extremely practical.
2. Description of the Related Art
In the field of language analysis processing, the importance of semantic analysis processing at the next phase of morphological analysis and syntax analysis is increasing. In particular, with case analysis processing and ellipsis analysis processing etc. that are principal elements of semantic analysis, it is desirable to alleviate the workload involved in processing and increase processing accuracy.
Case analysis processing is a type of processing for restoring a surface case hidden by topicalizing or adnominal modifying a part of a sentence. For example, the sentence “ringo ha tabeta()(As for the apple, I ate it)”, “ringo ha() (As for the apple)” denotes a topic of the sentence. When the sentence is analyzed and modified into a non-topicalized sentence “(watashi wa) ringo wo tabeta (() )(I ate the apple)”, the surface case will be come out as “ringo wo()”. In this case, the “ha()” or “ringo ha()” is analyzed, and “wo() case” is obtained as a surface case.
Further, in another example “kyou katta hon wa mou yonda() (I already read the book which I bought yesterday)”, “katta hon()(the book which I bought)” is the relative clause of the verb “(already read . . . )”. When the relative clause is analyzed and modified into a simple sentence “(watashi ha) hon wo katta(())(I bought a book)”, the result, “katta hon()”, syntactically has a case frame “wo() case.”
Ellipsis analysis processing is a type of process for eliciting a asyndetic part or a surface case of a sentence. Another example is “mikkan wo kaimashita. Soshite, tabemashita()(I bought tangerines. And I ate (them))”. The clause “soshite tabemashita()(And I ate (them))” in which the object is omitted, also called “a zero pronoun”, is analyzed and modified into the sentence “soshite mikan wo tabernashita()(Then, I ate tangerines)”. Therefore, the asyndetic part is turn out to be “mikan wo() (tangerines as an object)” which has a case frame “wo() case as an elliptic case particle.”
The following references are provided as related technology pertaining to the present invention.
The utilization of existing case frames as shown in the following cited reference 1 is given as a case analysis method.
[Cited reference 1: Sadao Kurohashi and Makoto Nagao, A Method of Case Structure Analysis for Japanese Sentences based on Examples in Case Frame Dictionary, IEICE Transactions on Information and Systems, Vol. E77-D, No. 2, pp227-239 (1994)]
Further, as shown in the following cited reference 2, case frames are constructed from a corpus that does not have groups of analysis targets and has no information added to it (hereinafter referred to as a “raw corpus”), and these case frames are then utilized.
[Cited reference 2: (Daisuke Kawahara and Sadao Kurohashi, Case Frame Construction by Coupling the Predicate and its Adjacent Case Component, Information Processing Institute, Natural Language Processing Society), 2000-NL-140-18 (2000)]
As shown in cited reference 3 in the following, in case analysis, frequency information for a raw corpus rather than for a corpus provided with case information is utilized, and case is then obtained through estimation of maximum likelihood.
[Cited reference 3: (Takeshi Abekawa, Kiyoaki Shirai, Hozumi Tanaka, Takenobu Tokunaga, Analysis of Root Modifiers in the Japanese Language Utilizing Statistical Information, Seventh Annual Conference of the Language Processing Society), pp270-271 (2001)]
As shown in cited example 4 in the following, a TiMBL technique (refer to cited reference 5) that is one type of k neighborhooed methods is used as a machine learning method employing a corpus with case information.
[Cited reference 4: Timothy Baldwin, Making Lexical Sense of Japanese-English Machine Translation: A Disambiguation Extravaganza, Technical Report, Tokyo Institute of Technology, Technical Report, ISSN 0918-2802, pp69-122 (2001)]
[Cited reference 5: Walter Daelemans, Jakub Zavrel, Ko van der Sloot, and Antal van den Bosch, Timbl: Tilbury Memory Based Learner version 3.0 Reference Guide, Technical report, ILK Technical Report-ILK 00-01 (1995)]
The research of Abekawa shown in cited reference 3 and the research of Baldwin shown in cited reference 4 only handles case analysis processing for performing transformations to embedded sentences.
Conventionally, case information for a corpus with case information taken as examples when performing case analysis on Japanese is supplemented manually. However, supplementing of the analysis rules and analysis information encounters a serious problem regarding the human resources and labor burden involved in expanding and adjusting rules. This point justifies the use of supervised machine learning methods in language analysis processing. However, in conventional supervised machine learning methods, a corpus supplemented with analysis target information is used as supervised data. It is necessary in this case to alleviate the labor burden involved in supplementing the corpus with analysis target information.
Further, it is necessary to use a large amount of supervised data in order to improve processing accuracy. The research of Abekawa in cited reference 3 and the research of Baldwin in cited reference 4 perform case analysis processing by employing raw corpus not provided with case information. This case analysis processing technology handles only transformation into embedded sentences.
Therefore, there is a demand for machine learning methods that can use a raw corpus that is not provided with information constituting an analysis target in a broader range of language processing.