In recent years, high expectations have been placed on efficient utilization of large amounts of textual information. Large amounts of text contain characteristic expressions with special meanings, such as personal names, place names, and names of organizations (hereinafter, groupings based on special meanings are referred to as “classes”). Recognizing these characteristic expressions is useful in systems utilizing textual information, such as question answering systems, document classification systems, machine translation systems, etc. For example, question answering systems have supposedly been imparted with characteristic expression recognition functionality. In such a case, it is not hard to imagine that imparting recognition functionalities improves the accuracy of responding because, in response to the question “Who is the Prime Minister of Japan?”, a question answering system can recognize a characteristic expression, i.e. “personal name”, and return the corresponding personal name. As used herein, the term “characteristic expression” designates an expression that has a specific meaning and refers to nouns with specific meanings, such as personal names, place names, job titles, or animal names, etc., or adjectives that have the meaning of evaluation expressions, such as “good”, “bad”, etc.
A method involving the creation of ground truth data (training data) obtained by annotating expressions belonging to classes that it is desired to extract from a text and acquisition of extraction rules (characteristic expression extraction rules) from the ground truth data with the help of machine learning is well known in prior-art characteristic expression recognition (extraction) technology. This method allows for excellent efficiency to be achieved in the recognition of characteristic expressions. However, the above-mentioned method is expensive to use because it requires annotation to be accurately assigned to the ground truth data in an omission-free manner. For this reason, low-cost ground truth data generation is an important requirement in characteristic expression recognition technology.
For instance, Patent Document 1 discloses an example of a system capable of creating conventional ground truth data at low cost. The system according to Patent Document 1 includes: a ground truth data storage unit that stores ground truth data, a ground truth expansion unit, and a rule learning unit. The ground truth expansion unit retrieves ground truth data from the ground truth data storage unit and performs word order operations, syntactic representation conversion, and specific representation conversion, etc., thereby generating expanded data produced by expanding the ground truth data. The rule learning unit learns extraction rules using both the ground truth data and the generated expanded data as training data.
Thus, in the system according to Patent Document 1, new ground truth data (expanded data) are created in large quantities by modifying the word order of the ground truth data, changing representations, etc. Accordingly, it is believed that using the system according to Patent Document 1 allows for the amount of training data to be increased at low cost.