(1) Field of the Invention
The present invention relates to a data compressing method and a data decompressing method, and a data compressing apparatus and a data decompressing apparatus therefor.
(2) Description of Related Art
Recent computers tend to deal various types of data such as graphic character codes, vector information, image pictures, etc. and a volume of the data dealt in the computer rapidly increases. When a large volume of data is dealt with, redundant portions in the data are omitted to compress the volume of data so as to decrease a required storage capacity or transmit the data faster. A universal coding is proposed as a method being capable of compressing various data in one manner.
Incidentally, this invention is applicable to various data, not limited to a field of character code compression. In this specification, one word unit is called a character, and data composed of the voluntary number of words is called a character string.
There are a dictionary coding utilizing analogy of data sequences, and a statistical coding utilizing frequency of occurrence of a data string. The above-mentioned universal coding is a representative method of the statistical coding.
As a representative manner of the universal coding, there is also an arithmetic coding. The arithmetic coding is to generate a code adaptable to an occurrence probability of each character without using a code table, considered to be able to compress at a maximum efficiency if an occurrence probability of each character of information source is known. The arithmetic coding is classified into a binary arithmetic coding and a multi-value arithmetic coding of more than binary value.
Hereinafter, the multi-value arithmetic coding will be described.
In the multi-value arithmetic coding, a number line of 0.ltoreq.P&lt;1 (described as 0,1! hereinafter) is, to begin with, divided by the number of events of characters having appeared (referred as a symbol, hereinafter).
A width of each section is taken as in proportion to a ratio of an occurrence probability of each symbol, and the sections are arranged in the order of high occurrence probability.
Next, a section corresponding to an occurring symbol is selected, the selected section is divided into sections corresponding to the number of all symbols when the next symbol occurs, and a section corresponding to the next symbol is selected so that the selected section is divided recursively.
The above process will be described in more detail by reference to illustrations showing a principle of the multi-value arithmetic coding in FIGS. 70(a) and 70(b).
FIG. 70(a) shows an example of frequencies of occurrence of symbols, while FIG. 70(b) showing an example of how to divide the section of a symbol.
By way of an example where a section of a graphic character string "abe" is obtained in the dividing.
Now, a number line (0, 1) is divided into five sections of characters a, b, c, d and e as shown in FIG. 70(a).
First, a section (0, 0.2) of a symbol "a" which occurs first is selected. Then, the selected section (0, 0.2) is further divided into five sections of all the symbols "a" through "e".
A section (0.04, 0.06) of the symbol "b" which occurs secondary is selected, then divided into five sections of all the symbols "a" through "e". A section of the symbol "e" which occurs last is selected so that a section 0.05, 0.06) of a character string "a b e" is finally obtained.
As above, by repeating the above process on all input data, it is possible to determine a section of a character string to be encoded. An arbitrary point in the section of the character string that is finally determined is represented in a binary notation, then outputted as a compressed code.
A term "arithmetic coding" is derived from that a code word is expressed in a binary numerical value after the decimal point as 0.11011..!, which can be obtained in calculation.
As a method to divide a section according to a frequency of occurrence as above, there are a static coding in which a section is divided according to a predetermined frequency of occurrence, not depending on an actual frequency of occurrence of a character string, a semi-adaptive coding in which a section is divided according to a frequency of occurrence obtained by scanning in advance all character strings, and an adaptive coding in which a frequency is calculated whenever a character occurs to reset the section for each character.
Meanwhile, a method to compress data into a unit of bite (character) using the above multi-value arithmetic coding for file compression is described in, for example, the following two documents (1) and (2). (1) "Arithemtic Coding for Data Compression," Commun. of ACM. Vol.30, No.6, PP.520-540 (1986) (2) "An Adaptive Dependency Source Model for Data Compression Scheme," Commun. of ACM, Vol.32, No.1, PP.77-83
The document (1) teaches an actual algorithm of the multi-value arithmetic coding. The multi-value coding in document (1) is one of the methods, which is called as an entropy coding which codes and compresses data by a single character unit. In the entropy coding, a probability of occurrence of a focused character is encoded in the multi-value coding and the probability of occurrence of each character is sequentially updated whenever a character occurs so that the entropy coding can code various data dynamically and adaptively. A detailed process of the multi-value coding is as shown in a flowchart in FIG. 71(a).
The document (2) teaches a method to express a focused character with a conditional probability using an immediately preceding character, encode the conditional probability in the multi-value coding so as to obtain a high compression rate. In the method in the document (2), each of the conditional probability is sequentially updated so as to be able to encoded various data dynamically and adaptively various data. In such the multi-value coding, the process shown in a flowchart in FIG. 70(b) is performed.
There is proposed, instead of the multi-value coding, a Dynamic Huffman Coding which is a modification of Huffman coding (refer to "Variations on a Theme by Huffman", IEEE Trans. Inform. Theory, Vol.24, No.6, 1978, or "Design and Analysis of Dynamic Huffman Codes", Journal of ACM, Vol.34, No.4, 1987). The Dynamic Huffman Coding has a coding efficiency inferior to the multi-value coding, which requires a longer processing time. For this, there is no actual application of a method to encode a conditional probability in the dynamic Huffman coding.
FIG. 72 shows an example of an algorithm of the multi-value arithmetic encoding and decoding.
Different from the arithmetic coding, there is a Splay-Tree coding (refer to "Application of Splay Tree to Data Compression" by Douglas W. Jones, Commun. of ACM, Vol.31, No.8, P.996-1007, for example).
The splay coding uses a code table in a tree structure (referred as a code tree, hereinafter) as shown in FIG. 73(a), where a symbol is entered at an end of the code tree (called a leaf, in general) and a distance from a top of the code tree (called a root, in general) to a leaf at which input data is stored is outputted as a code word.
More specifically, assuming the path goes down from the root to the leaf of the code tree, "1" is allocated to the code word when the path branches to the right, and "0" is allocated when the path branches to the left.
In the example shown in FIG. 73(a), the code of a symbol A becomes 10110! and a symbol B becomes 001!.
If the code length is changed (that is, if the code is updated), a coded leaf is exchanged with other leaf or a contact on the code tree (called as a node).
FIG. 73(b) shows an example of the above-mentioned code updating. As shown in FIG. 73(b), as input data, symbols A, B, C and D are initially stored in leaves of the code tree.
A node of the symbol A and a node of the symbol C are exchanged, a higher degree node D of the symbol A and a node of a symbol E are exchanged as shown in FIG. 73(C). As a result, the code of the symbol A becomes 110! from 10110! so that the code is updated.
The above case is to dynamically encode a probability of occurrence of each word in a variable length coding. To more increase a compression rate, a conditional probability of occurrence introduced a relationship of dependency between an input signal and the immediately preceding character therein is encoded in a dynamic variable-length coding.
This method is a statistical coding using a statistical characteristic of data. As shown in FIG. 74, this coding is performed in a process made up of two stages, that is, a context collecting step 511 and a dynamic variable-length coding step 512, as shown in FIG. 74.
A context in a front and behind relation with a character string is collected from input data at a context collecting step, a tree structure of a context as shown in FIG. 75(b) is generated and it is encoded in a dynamic variable-length coding to determine a conditional probability.
The above conditional probability is determined by counting the number of occurrence every time a character string crosses over a character in each node on a context tree in a tree structure shown in FIG. 75(b).
A context collecting method to determine a conditional probability is chiefly classified into the following two methods. Incidentally, the number of character of a condition (a context) will be hereinafter referred as degree (refer to "Data Compression Using Adaptive Coding and Partial String Matching", JOHN G. CLEARY et al., IEEE, Vol.COM-32, No.4, APRIL 1984, PP.396-402).
(1) Fixed-Degree Context Collecting Method
This method expresses a condition of a conditional probability with a fixed character number.
For instance, in an degree-2 context, a context of a character following two immediately preceding characters is collected to encode a conditional probability p(y.vertline.x.sub.1, x.sub.2), wherein y is a focused character to be encoded, x.sub.1 and x.sub.2 are the first character and the second character followed by the focused character, respectively.
(2) Blending Context Collecting Method
In the above fixed-degree context collecting method, if the immediately preceding conditional character string rarely occurs, an estimation of a conditional probability tends to be inaccurate. To the contrary, if the immediately preceding conditional character string frequently occurs, an estimation of the conditional probability tends to be accurate, where there still remains a possibility to increase the degree.
In general, data coming to have a stronger correlation between characters when the degree used is more increased can provide a high compression rate. However, data coming to have a weaker correlation between characters as the degree used is more increased provides a poor compression rate.
What can solve the above problem is to blend the context (blending of the degree). In this method, the degree of the immediately preceding character string is not fixed. The degree is increased if the occurrence is frequent, and if the occurrence is rare, the degree is kept in the lower number. As this, the degree is increased adaptively to input data.
However, the statistical coding employing an arithmetic coding to a dynamic variable-length code needs to compute once more an accumulated frequency of all data having been inputted every time data is inputted and to divide a number line of 0, 1) once more. There is required an enormous volume of arithmetic process so that it is impossible to increase the processing rate.