A technique is known by which a search is conducted for a file having a high degree of relevance to a character string serving as a search target (hereinafter, a “search-target character string”). According to this technique, one or more files each containing the words in the search-target character string are identified by using an index, so that a degree of relevance to the search-target character string is calculated for each of the identified files. The index denotes an information bit string indicating the files containing the words. Further, a list of candidates is displayed in a ranking format, by listing up the files in descending order of the degrees of relevance to the search-target character string.    Patent Document 1: Japanese Laid-open Patent Publication No. 2004-110271    Patent Document 2: Japanese Laid-open Patent Publication No. 08-147311    Patent Document 3: Japanese Laid-open Patent Publication No. 09-214352    Patent Document 4: Japanese Laid-open Patent Publication No. 05-241776
Examples of indexes that can be used in the search include an index in an N-gram format. In the index in the N-gram format, information is recorded to indicate, for each of N-gram sequences containing as many as N characters in a sequence, whether the N-gram sequence is contained in a file. A 1-gram of which the value of N is 1 may also be referred to as a uni-gram. A 2-gram of which the value of N is 2 may also be referred to as a bi-gram. A 3-gram of which the value of N is 3 may also be referred to as a tri-gram.
For example, when an index in a 1-gram format is prepared for Japanese text, although it is possible to keep the data size of the index small, large search noise may occur. For example, let us assume that the index in the 1-gram format records therein information indicating whether each of the 8,000 characters that are used with higher frequency is contained in a file or not. The index in the 1-gram format is configured to record therein only the information indicating whether each of the 8,000 characters is contained in the file or not. It is therefore possible to keep the data size of the index small. However, because the index in the 1-gram format records therein the information indicating, for each of the characters, whether the character is contained in the file, large search noise may occur. For example, when an index in the 1-gram format is generated with respect to a file recording therein “kyou-to-no-tou-bu (lit. Eastern Part of Kyoto)” (the example sentence is divided by “-” every character of Japanese), the index stores therein information indicating that the characters “kyou” “to” no “tou” “bu” are contained. When this index is used for conducting a search as to, for example, whether the word “tou-kyou (lit. Tokyo)” is contained or not, because the index has recorded therein that the character “tou” and the character “kyou” are contained, the search will erroneously find that the word “tou-kyou” is contained.
Incidentally, as for indexes in the N-gram format, the larger the value of N is, the smaller search noise will be. However, the larger the value of N is, the more significantly the data size of the index increases. For example, an index in a 2-gram format records therein information indicating, for each of 2-gram sequences obtained by combining the 8,000 characters used with higher frequency, whether the 2-gram sequence is contained in a file or not. For example, when an index in the 2-gram format is generated with respect to a file storing therein “kyou-to-no-tou-bu”, the index records therein information indicating that the two-character sequences “kyou-to”, “to-no”, “no-tou”, and “tou-bu” are contained. When this index is used for conducting a search as to, for example, whether the word “toukyou” is contained or not, because the index does not record therein that the two-character sequence “tou-kyou” is contained, the search will not find that the file contains the word “tou-kyou”. However, the index in the 2-gram format records therein the information indicating, for each of the combinations of 2-gram sequences of which the quantity is 8,000 times 8,000, whether the 2-gram sequence is contained in the file. Thus, compared to the example in the 1-gram format, the data size of the index is increased significantly. As explained herein, with indexes in the N-gram format, there is a trade-off relationship between reduction of the search noise and reduction of the data size.
Further, with English text also, for example, with respect to a phrase “This is a ball”, the 1-gram “a” is contained in the word “a” and the word “ball”, whereas the 2-gram “is” is contained in the word “This” and the word “is”. Accordingly, because large search noise may occur, an index of more than one gram is desirable. Thus, there is a trade-off relationship similar to the example with Japanese text.
To cope with this situation, another method is also possible by which, for example, a focus is placed on words with higher frequency so that an index records therein information indicating, for each of the higher frequency words having higher frequency of appearance, whether the word is contained in a file or not, and indicating, for each of the words other than the higher frequency words, whether each of the N-gram sequences structuring the words is contained in the file or not. However, search noise still may occur for the words other than the higher frequency words because the information is recorded to indicate whether each of the N-gram sequences structuring the words is present or not, unlike the higher frequency words for which the information is recorded to indicate whether each of the words is contained in the file or not.