In order to speed up searches of text data, a bitmap type index obtained by indexing the presence or absence of each character included in text data and each file is known (for example, see Patent Documents 1 to 3).
If a bitmap type index is generated for Japanese text data, because great many kinds of characters and words are used in the text data, the size of the index becomes large. In addition, because the density of the index becomes low, the size of the index is reduced by using hash functions.
A related technology discloses a method of applying hash functions with a plurality of bases and determining, as the result of creating hashed bitmaps, a hash function with a smallest collision (conflict). Furthermore, a related technology discloses a method of checking, by using a determined hash function and hashed bitmaps based on the determined hash function, whether the content at a lot number of a hashed bitmap assigned by an evaluation value of the determined hash function has already been set (for example, see Patent Document 1).
Patent Document 1: Japanese Patent No. 2753228
Patent Document 2: Japanese Patent No. 3263963
Patent Document 3: Japanese Laid-open Patent Publication No. 2012-216088
However, there is a problem in that, in the bitmap type index of the text data, because the line in which a bit associated with most of the files is “1” is present, if a hash function is applied, a clash may occur. For example, regarding a high frequency word, such as “the” or “on” in English, for most of the files, because a line containing the bit with “1” of the bitmap type index is present, if a hash function is applied, a collision may possibly occur.
In the related technology, no action is taken in a case of not predicting whether a collision occurs in an output value of the hash function. In contrast, in a case in which a collision can be predicted, an action is taken, such as eliminating the subject from the target for the hash function such that no collision occurs, which is inconsistent. In also the example of the related technology, selecting a hash function with a minimum collision is nothing more than a reduction in action in a case of the occurrence of a collision. When a collision actually occurs, a bitmap is unable to be correctly restored, which reduces the accuracy of the index and makes a search speed low.