Conventionally, when searching text data for a character string, text data and the character string are compared to determine whether the text data includes a character string that matches the character string. When the text data is compressed data, for example, the text data and the character string are not corresponding to each other, and accordingly, comparison with the character string is performed after the compressed data is decompressed.
There also is a case in which text data and a character string are encoded by an encoding scheme to improve the compression ratio. When text data and a character string are encoded based on the same encoding scheme, the text data and the character string can be compared directly without decoding (Japanese Laid-open Patent Publication Nos. 7-287716 and 11-143877).
However, in the above conventional technique, while the compression ratio can be improved if a specific character or word is allocated to a different encoding scheme, comparison in the encoding scheme cannot be processed at high speed.
Generally, character encoding schemes used when encoding text data can have redundant structure, and a character or word can be assigned to a code different from a code defined in a predetermined character encoding scheme. For example, a million words can be allocated to a three-byte code. To further improve the compression ratio, some words and characters appearing at high frequency can be converted into a one-byte code or two-byte code, not three-byte code.
When some words or characters appearing at high frequency are converted into the one-byte code or the two-byte code as described above, for example, text data including the one-byte code, the two-byte code, and the three-byte code mixed therein and text data in which a search character string is encoded into the three-byte code cannot be compared without processing, hindering to achieve high-speed processing.