1. Field of the Invention
The present invention relates to an apparatus and method for constructing learning data, capable of efficiently constructing learning data required in statistical methodology used in information retrieval, information extraction, translation, natural language processing, etc.
2. Description of the Related Art
Statistical methodology is currently used in information retrieval, information extraction, translation, natural language processing, etc. The statistical methodology requires construction of learning data according to each task, and the more learning data is constructed, the higher the performance.
An example of learning data with regard to a morpheme analysis of natural language processing and named entity recognition is described below.
Text 1: Eoje isunsin janggungwa maleul haetda (Korean transliteration of “I had a conversation with General Sun-shin Lee yesterday”)
Morpheme analysis: Eoje/nc isunsin/nc janggun/nc+gwa/jj mal/nc+eul/jc ha/pv+eot/ep+da/ef./s
Text 2: Hanguk∘Ilbon∘Manju∘Usurigang deungjie bunpohanda (Korean transliteration of “It is distributed in Korea∘Japan∘Manchuria∘Usuri River”)
Named Entity Recognition: <Hanguk:LCP.COUNTRY>∘<Ilbon:LCP.COUNTRY>∘<Manju:LC. OTHERS>∘<Usurigang:LCG.RIVER> deungjie btnpohanda.
Also, an example of learning data with regard to information extraction is described below.
Text 3: Hanyangdaehakgyo songsimon gyosunimeul mosigo “biochipeul iyonghan sample jeoncheori”e daehan naeyongeuro jeonmunga chocheong seminareul gaechoihagojahamnida (Korean transliteration of “We will hold an expert seminar entitled “Sample Pretreatment using Biochip” with professor Si-mon Song of Hanyang University”)
Information Extraction: <Hanyangdaehakgyo: lecturer. where the lecturer is from>∘<songsimongyosunimeul:lecturer.career> eul mosigo <“biochipeul iyonghan sample jeoncheori”:seminar. a title>e daehan naeyongeuro jeonmunga chocheong seminareul gaechoihagojahamnida
However, as the construction of learning data requires a great deal of time and effort, a learning data shortage often occurs.
Conventional methods of overcoming such learning data shortages are classified into three methodologies.
A first methodology involves using a workbench supporting an auto tagging function by means of machine learning. This method is similar to the present invention in supporting the auto tagging function, however, it does not support a function of gradually, automatically enhancing auto tagging performance by selecting a learning data candidate or reusing error-corrected data to increase the total amount of learning data.
A second methodology includes a bootstrapping method or a co-training method. These methods are similar to the present invention in supporting a function of enhancing learning data by automatically tagging the learning data, however, the methods do not support functions of correcting an error in an auto tagging result and selecting a learning data candidate. Also, in these methods, it takes considerable time to repeatedly perform machine learning because batch learning is used. Also, these methods have the disadvantage of performance deterioration due to inclusion of errors in auto tagging results.
A third methodology is an active learning method. Such a method is similar to the present invention in obtaining high performance with a small quantity of learning data constructed by selecting an optimal learning data candidate after generating learning models from initial learning data, and adopting the generated models to a raw corpus. However, it takes considerable time to repeatedly perform machine learning because batch learning is used. In particular, this method has the problem of learning time increasing with each repetition as the to amount of learning data increases.