In compression of a single file, encoding is performed using a static dictionary corresponding to high-frequency words and a dynamic dictionary generated corresponding to words not included in the static dictionary and further appearing a plurality of times in this file (for example, see Japanese Laid-open Patent Publication No. 09-214352). The static dictionary mentioned here is a dictionary that associates codes with high-frequency words in a file group or data as a population, and the dynamic dictionary is a dictionary that associates codes with words appearing a plurality of times in data to be compressed.
There is known a technology that generates index information indicating, when compressing a plurality of files, which of the files includes predetermined character information (for example, see International Publication Pamphlet No. W/O 2013/038527). The index information is used as an index indicating whether or not each of the plurality of files includes character information to be retrieved. The character information means character strings in which, for example, one-gram character codes are concatenated.
On the other hand, there is known a technology that generates pointer table-type index information associated with words (for example, see NISHIDA KESUKE: “Google wo sasaeru gijutsu”, Apr. 25, 2008, KUBAUHIKI KAISHA GIJUTSU HYOURONSHA). This technology will be explained with reference to FIG. 1. FIG. 1 is a diagram illustrating a reference example of a pointer table-type index generating process. As illustrated in FIG. 1, this technology extracts words from each document file, generates index information associated with a corresponding document ID, word IDs, and appearance positions thereof, collects pieces of the index information, and sorts the collected pieces of index information on the basis of the word IDs. Thus, a transposition index, namely, pointer table-type index information, is generated, which associates the document IDs and the appearance positions with each other on the basis of the word IDs.    Patent Literature 2: Japanese Laid-open Patent Publication No. 2008-278258    Non-Patent Literature 2: SEKIGUCHI KOJI: “ApacheLucene nyumon”, Jun. 25, 2006, KUBAUHIKI KAISHA GIJUTSU HYOURONSHA
However, the conventional technology has a problem that, when there exists a word to be registered in the dynamic dictionary, index information is not able to be easily generated, which indicates which of the plurality of files includes this word. On the other hand, from another viewpoint, there exists a problem that, when a word to be registered in the dynamic dictionary exists, index information indicating which of the plurality of files includes this word is not able to be easily distributed and generated to a plurality of small-scale systems.
For example, when compressing a plurality of files, index information can be generated with respect to words included in the static dictionary. On the contrary, when the codes in the respective files are different from each other with respect to a word registered in the dynamic dictionary, index information on all of the plurality of files are not able to be easily generated.
The index information generated by the conventional technology is index information on character information, and basically is not index information on words. Moreover, the static dictionary does not include any word referred to as a new word or a vogue word. Therefore, the conventional technology that generates the index information is not able to easily generate index information indicating which of the plurality of files includes this word.
On the other hand, because words included in one document file differ from words included in another document file, the conventional technology that generates the pointer table-type index information associated with words is not able to easily generate a pointer table-type index based on word IDs of words included in a plurality of document files. Moreover, because an updated or added document file can include a new word, a vogue word, etc. in some cases, a collection process, a sort process, and a transposition process of the index information are repeated again. Thus, this conventional technology needs a huge resource for the collection and transposition processes of the index information, so that it is impossible to easily distribute and generate the index information to small-scale resources.