1. Field of the Invention
The present invention generally relates to apparatus and methods for compressing/decompressing text data containing characters expressed by a plurality of bytes, and also to a program recording medium. More specifically, the present invention is directed to apparatus/methods capable of compressing/decompressing Japanese text data, and also to a program recording medium.
2. Description of the Related Art
Very recently, amounts of electronic texts processed or saved by personal computers have considerably increased in connection with prevalence of electronic mail and the like. For instance, there are many users who can handle hundreds to one thousand of electronic mails per day. It is not rare that text data more than hundreds Mega bytes are saved within a year.
Under such circumstances, the data transmission time may be shortened, and also the data storage capacity may be reduced by compressing amounts of data by removing redundant information. Various data compressing methods have been proposed and utilized. So far, there are some compression methods available compressing various types of data, ranging over character codes, vector information, and images. In these compressing methods, a so-called "universal coding" method is used.
Now, several coding methods classified into the "universal coding" method will be simply explained. It should be noted in the following description that a single unit of data is expressed as a "character", and a plurality of "characters" connected to each other are expressed as a "string" following the names used in information theory.
First, arithmetic coding method will now be summarized. There are two types of arithmetic codings, that is, binary arithmetic coding, and multi-valued arithmetic coding dealing with values higher than three. In multi-valued arithmetic coding, the number line 0 or more, and less than 1 (will be expressed as [0, 1) hereinafter) are sequentially narrowed in accordance with the occurrence probability (occurrence frequency) of each of characters which construct the data to be coded. Then, when the process for all the characters is completed, the numeral value indicating the one point in the narrowed range is outputted as the code.
For example, in the case that five characters to be coded are a, b, c, d, e, and the occurrence probabilities of these five characters are 0.2, 0.1, 0.05, 0.15, 0.5, respectively. To each character, a range is allocated, whose width corresponds to the occurrence probabilities thereof (see FIG. 24).
Then, in the case that a string to be coded is "abe", as schematically illustrated in FIG. 25, first, a range [0, 1) is narrowed to another range [0, 0.2) with respect to the character "a". Subsequently, this range [0, 0.2) is subdivided into ranges, depending upon the occurrence probabilities of the respective characters, and a range [0.04, 0.06) calculated based on the range of "b" is selected as a range for another string "lab". Furthermore, this range [0.04, 0.06) is subdivided into ranges in response to the occurrence probabilities of the respective characters, and then another range [0.05, 0.06) calculated based on the range of the next character "e" is selected as the range for the string "abe". Thereafter, the bit string less than the decimal point when a position of an arbitrary point (for instance, a lower limit point) within this final range is expressed by the binary number is outputted as the coded result.
It should be noted that arithmetic coding method is further classified into the static coding system, the semi-adaptive coding system, and the adaptive coding system, depending upon the methods for subdividing the range in response to the occurrence probabilities (occurrence frequencies). In the static coding system, the range is subdivided in accordance with the preset occurrence frequencies irrelevant to the actual occurrence frequencies of the respective characters. In the semi-adaptive coding system, the range is subdivided based on the occurrence frequency obtained by scanning the overall characters in the beginning. In the adaptive coding system, the occurrence frequency is again calculated every time the character appears to thereby again set the range. This arithmetic coding system is described in, for instance, "Text Compression" written by Bell, T. C., Cleary, J. G., and Witten, I. H. (1990), published by Prentice-Hall, Inc.
On the other hand, another universal coding method called as a "splay coding method" is also known in this technical field. In the splay coding method, a process for rearranging the code tree (namely, code table with tree structure) is carried out so that a shorter code can be allocated to a character which has higher occurrence frequency, every time a character is coded. The splay coding method is described more in detail, for example, in "Application of Splay Trees Data Compression" written by Jones, Douglas W., Commun. ACM, vol. 31, No. 8 pages 996 to 1007, August in 1988.
Also, another coding method, called blending splay coding method is known. The blending splay coding method is such that a statistical model called the blending model is adopted in the splay coding method.
In the blending splay coding method, a code tree is prepared with respect to each context. As schematically illustrated in FIG. 26, a context is equal to a string ("ab") existing immediately before a character to be coded ("c"). In the blending splay coding method (blending model), the number of characters used as a context order is controlled in response to the appearing degrees of the context in a context tree shown in FIG. 27. In general, that is to say, when data with strong correlation between the characters is coded, the higher the order of the used context is, the higher the compression rate can be. On the other hand, when the data with weak correlation between the characters is coded, using the higher order context, sometimes declines the compression ratio, instead of improving. In order to avoid this problem, the blending model technique has been developed. In the blending model, the orders of the respective contexts are changed in correspondence with the input data in such a manner that when a certain context easily appears, the order of this context is increased, whereas when another context does not easily appear, its order remains low.
Since the above-described respective coding methods were developed in cultural areas using the alphabet, 1 byte is handled as one character when the data is compressed by using the respective coding methods. As a result, there is a problem that when a sentence containing characters expressed by 2 bytes, e.g. Japanese language, is compressed using the respective techniques, not so higher compression rate can be achieved, as compared with English text.
In other words, in a 2-byte character, only a combination of 2-byte data make sense, and there is no correlation between each byte, constituting a 2-byte character. As a consequence, the conventional compression method which processes the 2-byte character in a unit of 1-byte, cannot attain higher compression ratio, because, in view of information theory, it compresses the data after reducing the information source (2-byte data is subdivided into 1 byte).
There is another problem such that it is difficult to achieve the high compression rate by using the context. In other words, since there are thousand sorts of chinese-characters that are used in an ordinary text, when the texts, substantially, with the same size are compressed using the same order of contexts, a large number of contexts appearing Japanese text, as compared with English text. Actually, total numbers of 4-byte contexts are given as follows, when both an 8 KB-Japanese text and an 8 KB-English one were compressed. Approximately 3,000 sorts of contexts appeared in the latter, and approximately 5,000 sorts of contexts appeared in the former. Also, there is a possibility that Japanese texts, to be compressed, have relatively small capacities (approximately, several A-4 sized sheets) such as electronic mails. As a result, when a Japanese text is compressed, the process is sometimes over, before the sufficient statistical information related to the respective contexts is gathered. This may cause lowering of the compression ratio of the Japanese text.