The present disclosure relates to an information processing apparatus, an information processing method, and a program, and more particularly, to an information processing apparatus, an information processing method, and a program capable of executing a process of constructing and extending a database (relevance database) that describes a relevance.
Since an abundance of document data has recently been digitized, the lot of document data can be browsed via document accumulation databases, the Internet, and the like. To efficiently obtain necessary information from the abundant document data, processes of creating and using analysis data for documents have been variously suggested.
For example, there has been suggested processes of constructing and using a database that describes the relevance between entities such as two words appearing in a document.
The database that describes a relevance between entities such as two words is referred to as a relevance database. The relevance database can be used for various applications.
The overview of the processes of constructing and using the relevance database will be described.
For example, the following information is registered in the relevance database:                (a) [Taro, Tokyo] as entities such as two words; and        (b) (BIRTHPLACE) as a label (relevance label) indicating the relevance between the entities.        
The two words (entities) and the relevance label are registered in correspondence with each other.
In this way, the following data are registered in correspondence with each other in the relevance database:                (a) a plurality of entities (words); and        (b) a label (relevance label) indicating the relevance between the entities.        
Efficient document analysis can be executed by using the relevance database having the registered information.
For example, a sentence (phrase) including two words [Taro, Tokyo] registered in the database is detected from a document to be analyzed.
Thus, the sentence determined to include the registered entities can be determined to be a descriptive text of “BIRTHPLACE.”
The birthplace [Tokyo] of [Taro] can be extracted by retrieving the relevance database using one entity [Taro] and the relevance label “BIRTHPLACE” as keys.
Various analyses of the document can be executed at high speed by using the relevance database.
However, in the process of constructing the relevance database, it is necessary to extract relevant words from the document to be processed and determine a label indicating the relevance.
It is costly to execute such a process by manpower.
Many documents have already been constructed in the relevance database, but new abundant documents are disclosed on the Internet or the like every day. However, it is difficult to timely extend the relevance database to correspond to the daily increasing information.
For example, a technique which discloses a process of constructing a relevance database can be exemplified as the following technique according to the related art.
A method of granting teacher information to a document using an existing relevance database and extending the relevance database is disclosed in Distant supervision for relation extraction without labeled data by Mike Mintz, Steven Bills, Rion Snow, and Daniel Jurafsky in 2009, in Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP (ACL '09), pages 1003-1011, Association for Computational Linguistics. A process of using a pair of entities and a relevance label registered in the relevance database, for example, a pair of entities [Taro, Tokyo] and a relevance label (BIRTHPLACE), which are correspondence data, is described in Distant supervision for relation extraction without labeled data by Mike Mintz, Steven Bills, Rion Snow, and Daniel Jurafsky in 2009, in Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP (ACL '09), pages 1003-1011, Association for Computational Linguistics.
A relevance label (here, BIRTHPLACE) is granted as a teacher label to the description (for example, “Taro born in Tokyo”) extracted from a document using data registered in the relevance database. A process of resolving a classification problem using the information registered in the relevance database and extending the relevance database is disclosed.
However, when this method is executed, a problem may arise in that a wrong teacher label is granted. For example, the pair of entities [Taro, Tokyo] and the relevance label (BIRTHPLACE) registered in the relevance database may be granted to the following phrases:                (A) Taro lived in Tokyo, and        (B) Taro died in Tokyo.        
Specifically, an error may arise in that the pair of entities [Taro, Tokyo] and the relevance label (BIRTHPLACE), which are granted to the phrase “Taro born in Tokyo”, are set in a sentence, that is, the phrases (A) and (B) (phrase) which do not describe the birthplace.
Thus, information provided from the existing relevance database is not complete teacher information. This is because the relevance label is granted to the technique in a document by using the pair of entities as a key even when the pair of entities generally indicates a plurality of relevance. In the method disclosed in Distant supervision for relation extraction without labeled data by Mike Mintz, Steven Bills, Rion Snow, and Daniel Jurafsky in 2009, in Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP (ACL '09), pages 1003-1011, Association for Computational Linguistics, the relevance label is erroneously included in teacher data, thereby deteriorating the entire capability.