The present invention relates to a system of searching texts for keywords and a method thereof, and particularly to a system of effectively searching for keywords by using indexes prepared in advance and a method thereof.
Along with recent progress in communications networks and information processing apparatuses, many texts are stored as digital data. Consequently, text mining has drawn attention as a technology for obtaining useful information from among these texts. In text mining, there is a practical problem that “N keywords belonging to any category are detected in descending order of frequency of appearance from among a set of texts which have been narrowed down under any search condition (refer to Yu C, Philip G, Meng W Y. Distributed top-n query processing with possibly uncooperative local systems, Proc. of the 29th Int'l Conf. on Very Large Data Bases. Berlin: Morgan Kaufmann Publishers, 2003. 117-128, hereinafter referred to as Non-patent Document 1).
A solution to the above problem can be obtained by constructing an RDB (Relational Database) with identifications of texts and identifications of keywords as primary keys. This RDB is, for example, a database which records keywords contained in a certain text, in a way that the keywords correspond to the text. However, in a case of using such an RDB, if the number of texts becomes huge, search time also becomes enormous. Therefore, heretofore, a technology for calculating the above problem in parallel by using a plurality of information processing apparatuses has been proposed (refer to Non-patent Document 1).
However, the method of the above-described Non-patent Document 1 requires parallel/distributed computing systems, and costs a huge amount of money and time. That is, for example, a plurality of information processing apparatuses have to be installed, and these information processing apparatuses have to be connected with fast communications networks. Hence, it is desired that an effective search technology, which makes it possible to perform a search by using a single information processing apparatus, is developed. For example, it is conceivable that it is possible to speed up a search, by applying a conventional text search technology, by using identifications of texts and keywords as numbers, and by beforehand preparing data for indexes and hash structures based on the numbers. Specifically, the following two indexes can be considered.
(1) KEY_TO_DOC Index
This index is reference from identifications of keywords being arranged in descending order of frequency of appearance to a list of identifications of texts containing the keywords.
(2) DOC_TO_KEY Index
This index is reference from identifications of texts to a list of keywords contained in the texts.
In a process using the index (1) described above, for example, keywords are sequentially selected in descending order of the frequency of appearance, and it is determined whether a list of texts containing the keywords satisfies a text search condition. N keywords are selected in descending order of the number of texts satisfying the text search condition, and the selected result becomes a search result. However, in a case where there are many kinds of keywords to be search targets, it requires search time depending on the number of kinds of keywords.
In a process using the index (2) described above, for example, texts satisfying a text search condition are selected, and a list of keywords corresponding to the identifications of the texts is obtained. Then, the number of texts which contain the keywords are counted for respective keywords. However, in a case where there are many kinds of texts to be search targets, it requires search time depending on the number of kinds of texts. Although it is conceivable to speed up a search by sampling some texts, search accuracy is reduced in a case where a sufficient number of texts are not prepared.