The present invention relates to a technique for extracting statistical information from a correct answer case, and for preparing a case that a learning device making an inference about an unknown case uses as a learning target. Here, a correct answer case corresponds to a case whose characteristic to be inferred is already known, whereas an unknown case corresponds to a case whose characteristic to be inferred is not known.
Particularly, the present invention relates to a case accumulating apparatus and method preparing a correct answer case in the case that whether an inference result is a correct answer is clear to a human being, but unclear to a machine, such as the case of text classified by a field, tagged text, a correspondence between an image file and a character in optical character recognition (OCR), a name of an object represented by an image, etc.
A variety of methods performing learning from a correct answer case with a statistical method, and making an inference for an unknown case have been proposed as stated below.
(1) A method to automatically classify a document, preparing a correct answer document group the field of which is determined, generating a statistical standard (inference rule) for classification from the prepared correct answer document group by using a statistical estimation method (learning) such as an appearance frequency of a word, etc., and estimating the field of an unknown document by using the standard. The statistical standard for classification may not always be human-readable. A weight of a neural network, a combination of keywords in principal component analysis, or the like may be available.
(2) A method to filter documents, using a process classifying a document required by a user and a document not required by the user, generating a statistical standard for classification by using the information about a word that is considered to be a clue for the determination at that time, and filtering a new document by using the standard.
(3) A method to automatically tag a text, preparing tagged correct answer text, generating a standard for tagging by using the information about a word in the vicinity of a tag, etc., and tagging an untagged document by using the standard.
(4) A method to implement OCR with high accuracy, preparing a correspondence between an image file and a correct answer character, generating a standard for recognition from the correspondence by using the information about a line element, etc., and determining to which character an unknown image file corresponds by using the standard.
(5) A method to determine a name or a characteristic such as a color, etc. of an object represented by an image, preparing a pair of an image file and a determination result of a correct answer, generating a determination standard by using pixel information from the pair, and determining to which determination result an unknown image belongs by using the standard.
These methods can be considered to be frameworks for recognizing a correct answer case to belong to a certain category, extracting a correspondence between the characteristic of a case and the category of a correct answer, and inferring the category of an unknown case by using the correspondence. For such frameworks, diverse techniques have been proposed to improve the accuracy of an inference.
By way of example, as far as automatic document classification is concerned, Japanese Patent Application Publications Nos. 5-54037, 5-233706, 5-324726, 6-131225, 6-348755, 7-36897, 7-36767, 7-49875, 7-78186, 7-114572, 7-19202, 8-153121, etc. are cited.
However, the above described conventional inference methods have the following problems.
These methods assume the case where a sufficiently large number of correct answer cases exist, and significant information for categorization can be extracted from the correct answer cases. However, for example, if Web or in-house documents are classified in document classification, the number of categories sometimes ranges from several hundreds to several thousands. For the categories, it requires a considerable amount of labor to prepare a sufficiently large quantity of correct answer cases (at least 100 cases for each category) for generating an inference rule with sufficiently high accuracy.
Additionally, as frameworks for presenting information that appears to be a clue for an inference, and for making an inquiry to a user, apparatuses recited by Japanese Patent Application Publications Nos. 9-22414, 9-153049, etc. exist. However, these are not the frameworks for efficiently generating a correct answer case in cooperation between a learning device and a user. With these apparatuses, correct answer cases cannot be accumulated with simple operations.
Furthermore, for a tagged corpus (a database of tagged texts), it is difficult to prepare a sufficiently large quantity of text examples for generating a tagging rule with high accuracy. Similarly, for Japanese character recognition in OCR, the number of types of characters reaches as many as several thousands. Therefore, it is difficult to prepare a sufficiently large quantity of correct answers with which a rule for recognition can be generated for each character.
Normally, if a sufficiently large quantity of correct answer cases do not exist, a good inference algorithm or a good characteristic with which a correct answer rate becomes as high as possible is searched in many cases. However, if a sufficiently large quantity of correct answer cases do not exist, an inference with high accuracy cannot be made with any method in most cases. In this case, correct answer cases are forced to be manually accumulated. Accordingly, it is vital to determine a way of efficiently performing a process for accumulating correct answer cases.
An object of the present invention is to provide a case accumulating apparatus and method efficiently accumulating a sufficiently large quantity of correct answer cases based on a small number of correct answer cases in order to generate an inference rule with high accuracy, even when only the small number of correct answer cases exist.
A case accumulating apparatus according to the present invention comprises a storage device, a learning device, an inquiry device, and a control device.
The storage device stores information about a set of correct answer cases. The learning device generates an inference rule while referencing the information stored in the storage device, and infers a target characteristic from a known characteristic of a case to be inferred in compliance with the inference rule.
The inquiry device inquires of a user as to whether or not an inference result of the learning device is correct, and receives a response from the user. The control device determines the target characteristic of the case to be inferred based on the response, and adds information about the case to be inferred including the determined target characteristic to the information about the set of correct answer cases.