A technique using a bitmap index is known as one method of speeding up document retrieval. The index represents, in a bitmap, for example, whether each character representable by a predetermined character code is included in each of a plurality of documents. For example, in the bitmap, for each of such characters, 1 is set in the location of the document number of each document including the character, and 0 is set in the location of the document number of each document not including the character. This creates a bit string for each character, indicating the presence and absence of the character in the individual documents using 1's and 0's.
There is a technique of performing Huffman compression on bit strings each indicating the presence and absence of a character. For example, a data compression apparatus has been proposed which calculates the frequency of occurrence of the character for every four bits (one digit) or every eight bits (one byte) and applies entropy coding for every four (or eight) bits.
In addition, a bit string compression technique has been proposed which uses a special Huffman tree with leaves for sixteen types of symbol strings corresponding to all patterns represented with four bits and special symbol strings whose bit number is larger than 4.
On the other hand, bit strings are compressed by employing run length encoding (RLE) used to compress character strings. RLE represents, for example, data with sequences in each of which the same character occurs in consecutive data elements by each character of the sequence and the number of its consecutive runs (run length). In the case where data contains only two types, and 1, a bit string is represented by alternately arranging the run length of one type (e.g., 0 and the run length of the other type (e.g., 1)).
Japanese Laid-open Patent Publication No. 2005-260408
International Publication Pamphlet No. WO2008/146756
However, sufficient compression effects may not be achieved by compressing a bitmap including consecutive runs of the same value using a simple combination of RLE and entropy coding. The same value in such sequences is 0 in the case of, for example, a bitmap for kanji characters (Chinese characters used in Japanese language). In RLE, for example, consecutive runs of the same value may occur with low frequency, thus creating small deviations. On the other hand, entropy coding is a coding system offering higher compressibility when there are larger deviations in the probabilities of occurrence of codes to be compressed. Therefore, sufficient compression effects may not be achieved by applying entropy coding directly to codes obtained by RLE.