The conventional technique for compressing text data detects any string repeated in the text data and encodes the detected string. To encode a string having a plurality of words word by word, the string is replaced with a plurality of codes. For example, the conventional technology encodes the string of “Kanagawa Prefecture Kawasaki City Nakahara Ward” word by word for each of “Kanagawa”, “Prefecture”, “Kawasaki”, “City”, “Nakahara”, and “Ward”.
However, the conventional technology mentioned above does not improve compression efficiency.
For example, a technology such as the ZIP searches strings included in text data for a longest match string using a slide window and encodes the longest match string as a whole byte by byte. In this case, even when it is found that a longer string is available as the longest match string, it is impossible to change the longest match string held in the slide window. Thus, even when a higher compression efficiency is expected, this opportunity is not sufficiently exploited in some cases. For example, when a short part of a string is held in the slide window as the longest match string at the previous stage, this matching prevents a longer string from being held as the longest string in the slide window.
For this reason, a string including a plurality of words is preferably encoded as a whole, even after the string is encoded word by word so that the words are individually encoded.