The present invention relates to a database used in text mining or the like. In particular, the present invention relates to a system, method and program for creating an index for a database.
In a typical application example of text mining, it is necessary to provide a search condition to a text mining system in an interactive manner, and then find a keyword having a high correlation with the search condition.
For example, consider a case where call log records at a PC call center are analyzed as the object. In particular, a problem that frequently appears in a particular product is desired to be found in this case. A search is performed by using the product number as the search condition in this case. Then, by counting the number of keywords each appearing in the documents found by the search, keywords that are frequently mentioned with the product are found.
Moreover, in the text mining system, a category can be provided for a keyword in advance. For example, a category titled “problem expression” is provided for a keyword “heat generation.” Problems can then be efficiently found by counting the number of keywords that belong to this category only.
As described above, in the application example of text mining, a search condition is provided to the text mining system in an interactive manner, and the result of the search is then verified. In such text mining, it is necessary to count the numbers of keywords in a dynamically provided document set. A relational database may be utilized as an index structure for calculating the number of keywords at high speed. However, the relational database does not provide performance sufficiently high enough to perform the correlation analysis between the search condition and the frequencies of keywords.
In this respect, as an index structure and an algorithm for executing mining at high speed to respond to such purpose, there is a technique described in Japanese Patent Application No. 2005-349717 by the present applicant. However, by use of the technique described in Japanese Patent Application No. 2005-349717, an index structure, which is proposed in the patent document, is difficult to build for large scale data. The primary reason for the difficulty is that the size of data becomes too large to be retained in the main memory. To be more precise, when the relationships between the keywords and the documents included in the text mining database are mapped in a matrix structure, the size of data becomes large. As a result of this, all the necessary information cannot be retained in the main memory as the number of documents included in the database increases.
Specifically, in more detail, in order to build an index at high speed, a map indicating the correspondence between the keyword character strings and the numeric values of IDs needs to be retained in the main memory. Moreover, in order to search by a keyword for a posting list (that is, an array of document IDs) of the documents corresponding to the keyword from data in a certain structure, the data must be also arranged in some order with respect to keywords (for example, in an order of frequency of keyword appearance). In this case as well, however, unless a hash structure having a keyword set is retained in the main memory, the merging of indices obtained by dividing the entire index in document unit basis is difficult. In this respect, it is an essential matter that the size of the main memory is large enough to maintain all the keywords required for creating an index. Accordingly, since the size can be increased to some extent only within a certain limit, the size of the main memory determines the limitation of the number of documents for which an index structure can be created in the document set.
Incidentally, in U.S. Pat. No. 6,553,385 and the UIMA Java Framework available via SourceForge, a framework is described for extracting information by applying a technique such as a natural language processing to each of the documents of the document set, and then storing the information in a predetermined data structure. This disclosed technique, however, is not one that suggests a technique to efficiently store large scale data for sequentially processing information obtained by processing one document.
Japanese Patent Application Laid-open Publication Hei 9-212528 discloses a technique including a step of dividing a database into a plurality of data segments. In this technique, the database segments respectively correspond to ranges having different values from each other in a selected field in the database. In addition, this technique includes the steps of storing each of the data segments in various storage devices; of storing a segment index for identifying each of the corresponding database segments; and of storing a range index having entries corresponding to a plurality of ranges in the selected field. Then, in this technique, each of the entries in the range index identifies the segment index corresponding to the range among the plurality of data segments.
In the technique disclosed in Japanese Patent Application Laid-open Publication No. 2003-271648, search target documents are divided into a plurality of groups, firstly, and then each of the groups, a keyword appearing in the search target documents included in this group and the number of the search target documents in which the keyword appears are stored in association with one another.
As described above, Japanese Patent Application Laid-open Publications Nos. Hei 9-212528 and 2003-271648 suggest the techniques of achieving a faster search by dividing a database into a plurality of segments and thereby balancing the loads of the processing of data in order to support a large scale search. The methods suggested in these documents, however, only relate to a database search, so that the methods cannot be applied to the creation of an index of a large scale text mining database.