There is a technique of extracting a specific type of information using a large corpus (set of example sentences). The specific type of information may be any of various information such as named entities and evaluative expressions. In the following description, it is assumed that data (hereafter simply referred to as “expression”) indicating an expression which evokes a problem is to be extracted.
Extracting a specific type of information is equivalent to identifying whether data belongs to the specific type (positive example) or does not belong to the specific type (negative example). In machine learning, there is normally a need to prepare certain amounts of positive examples and negative examples as training data beforehand, in order to create rules for identification. This requires a high-cost human work.
Non Patent Literature (NPL) 1 describes a creation method for named entity extraction aimed at reducing the training data preparation cost in named entity extraction and, as means for this, a technique of identifying positive examples/negative examples of training data.
In the method described in NPL 1, training data candidates for positive examples/negative examples are created from a corpus using dictionary data (hereafter simply referred to as “dictionary”). The dictionary records pairs of words and classes. A class indicates what kind of proper noun, such as a personal name, an organization name, a place name, or the like, a word belongs to. A training data candidate is created by assigning a class to a word that matches the dictionary. Even when the word matches the dictionary, however, a correct class is not necessarily assigned to the word. That is, the training data candidates include “false data” to which a false class is assigned.
In the method described in NPL 1, the following process is carried out. Clustering is performed on the training data candidates including the false data so that data belonging to the same class are included in the same cluster as much as possible, through the use of surrounding words. Using, as identification information, the presence of a specific class locally included in a cluster as a majority in the clustering, data of the locally-included specific class is determined as a positive example.
Patent Literature (PTL) 1 describes a method of searching a set of search target documents for a set of extraction target documents by an initial search expression, and for a definite positive example which is a set of documents matching a search purpose and a definite negative example which is a set of documents not matching the search purpose.