A technique is known by which a file having a high degree of relevance to a search target character string is searched for from a plurality of files. According to this technique, the file including a word in the search target character string is identified by using an index. Further, the number of times the word appears (hereinafter, “the number of times of appearance”) in the search target character string is calculated by conducting a search in the identified file, and further, the file having a high degree of relevance is identified based on frequency of appearance. The index denotes data indicating one or more files containing each of the words. For example, in the index, one bit is kept in correspondence with each set made up of a word and a file so as to store therein, by using a value in the bit, whether or not the file corresponding to the bit includes the word.
In such an index, when one bit is kept in correspondence with a file for each of the words, the data size is prone to be large when a large number of files are involved. For this reason, a technique is known by which the data size of the index is compressed to a smaller level, by bringing multiple bits into correspondence with one bit with the use of mutually-different mathematical functions.
For example, a bit array A in an index kept in correspondence with files for each of the words is converted into a bit array X and a bit array Y, by bringing multiple bits into correspondence with one bit while using mutually-different two hash functions. The bit array X and the bit array Y have been converted by using the mutually-different hash functions. For this reason, for example, a plurality of files that are kept in correspondence with mutually the same bit in the bit array X are kept in correspondence with mutually-different bits in the bit array Y. Accordingly, in the bit array X and the bit array Y, when the bits corresponding to a certain file indicate that a certain word is included, it is identified that the certain file includes the certain word.
Patent Document 1: International Publication Pamphlet No. WO 2013/175537
In the field of text mining, however, according to a related technique, the number of times of appearance is counted for each of the words and relevant synonyms, which involves calculations of scores based on the counts. According to the related technique, because the number of times of appearance is counted in this manner for each of the words included in the files that were identified, in the index, as including the word, it may take time to perform the processing in some situations.
To cope with this situation, it is also possible to use another method by which the index is structured as a count-map type index storing therein information about the number of times of appearance of each of the words in each of the files. For example, in parallel to a compressing process by which a code is assigned to each of the words in character strings in text files, the number of times of appearance is stored into the index while using multiple bits that are kept in correspondence with a set made up of a word and a file. When multiple bits are kept in correspondence with each set made up of a word and a file in this manner, the data size of the index is prone to be large. To cope with this situation, it is also possible, like in the related technique, to compress the data size of the index to a smaller level by using mutually-different mathematical functions. Similarly to bit-map type indices, because count-map type indices in which the number of times of appearance is stored by using multiple bits are also prone to have bit conflicts in hash functions, a lot of noise may occur when the compressed data is restored, in some situations.