1. Field of the Invention
The present invention relates generally to a data compression/decompression method of encoding data in a variety of forms and decoding the compressed data, and also to a data encoding apparatus and a data decoding apparatus. The present invention relates more particularly to a data compression/decompression method, a data encoding apparatus, and a data decoding apparatus for encoding and decoding data based on a statistical compression method.
2. Description of the Related Art
With a fast advancement of computers in recent years, a large capacity of data are treated in the computer, and it is a practice that the data are compressed in order to reduce a transmission time and efficiently use a storage unit.
A variety of coding methods are used for compressing the data. A particular coding method, known as a universal coding method, is applicable to various items of data without limiting the data to particular items such as character codes, vector data and images. A known type of universal coding is dictionary-based codings, which makes use of a similarity between the character strings and a statistical coding. The statistical coding translates each character's probability to a sequence of bits. Note that in the following discussion one unit of the data is expressed as a "character", and a plurality of "characters" connected to each other is expressed as a "character string".
The standard coding of a statistical coding may be Huffman coding and arithmetic coding. Before going into a detailed description of the Huffman coding, a code tree (defined as a data structure) used when generating the Huffman codes will be explained.
FIG. 21 illustrates one example of a code tree. Nodes are points marked with a circle (.smallcircle.) and a square (.quadrature.). A line segment connecting the nodes is called a "branch". The node located in the highest position is called a "root". Further, an under node Y connected via the "branch" to a certain node X is termed a "child" of the node X. Reversely, the node X is referred to as a "parent" of the node Y. A node having no "child" is called a "leaf", and a particular character corresponds to each "leaf". Further, the nodes excluding the "leaves" are referred to as "internal nodes", and the number of "branches" from the "root" down to each "node" is called a level.
When encoded by use of the code tree, a path extending from the "root" down to a target "leaf" (corresponding to a character to be encoded) is outputted as a code. More specifically, "1" is outputted when branching off to the left from each of the nodes from the "root" down to a target "leaf", while "0" is outputted when branching off to the right. For instance, in the code tree illustrated in FIG. 21, a code "00" is outputted for a character A corresponding to a "leaf" of a node number 7, and a code "011" is outputted for a character B corresponding to a "leaf" of a node number 8.
When decoded, a character is outputted which corresponds to a "leaf" which is reached by tracing the respective nodes from the "root" in accordance with a value of each bit of code defined as a target for decoding.
According to the Huffman coding, the above-described code tree is generated by the following procedures (called a Huffman algorithm).
(1) Leaves (nodes) corresponding to the individual characters are prepared, and the frequency of occurrence of the characters corresponding to the respective leaves are recorded.
(2) One new node is created for two nodes having the minimum occurrence frequency, and this created node is connected via branches to the two nodes. Further, a sum of the occurrence frequencies of the two nodes connected via the branches is recorded as an occurrence frequency of the newly created node.
(3) The processing set forth in item (2) is executed for the remaining nodes, i.e. the nodes not having parents, until the number of remaining nodes becomes 1.
In the code tree generated by such procedures, it follows that a code is allocated to each character with a code length which is inversely proportional to the occurrence frequency of the character. Therefore, when the coding is performed by use of the code tree, it follows that the data can be compressed.
The coding using the Huffman codes is further classified into static coding, semi-adaptive coding, and adaptive coding.
According to the static coding, normally, the occurrence frequency of each character appearing within the data to be encoded is first counted and the code tree is created based on the counted occurrence frequency in the above-described procedures. Next, the relevant data is encoded by use of the code tree, and an encoded result is outputted as a piece of encoded data together with data representing a configuration of the code tree. That is, code trees having leaves which correspond to the characters to be encoded are prepared according to the static coding and the coding is then executed using those code trees. Then, on the decoding side, decoding is carried out by use of the code trees outputted together with the codes.
According to semi-adaptive coding, as in the case of the static coding the code trees having the leaves relative to all of the characters to be encoded are prepared. However, the code tree prepared first is generated by setting respective proper initial values to the occurrence frequencies of the individual characters. In the semi-adaptive coding, the code tree is modified to assume a configuration corresponding to the occurrence frequency of each character that changes corresponding to the input data.
As explained above, there must be prepared code trees having the leaves relative to all the characters to be encoded in the static coding and the semi-adaptive coding as well. In contrast, when adaptively encoded, code trees are prepared in which all characters do not have corresponding leaves, i.e., the code trees only have leaves which are relative to some characters and non-registered characters. According to the adaptive coding, if the leaves pertaining to the characters to be encoded do not exist in the code trees, there are outputted the codes for the non-registered characters and the characters themselves (or the codes into which these characters are encoded based on a predetermined coding rule). Thereafter, the leaves relative to those characters are added to the code trees.
Note that the code tree is normally formed so that an FGK (Faller-Gallager-Knuth) algorithm can be applied to update the configuration of the code tree because the updating is performed frequently in the semi-adaptive coding and the adaptive coding. That is, as illustrated in FIG. 22, the code tree is formed so that the occurrence frequency to be recorded gets larger at lower levels and that the occurrence frequency becomes larger at more leftward nodes with respect to the nodes at the same level.
According to the Huffman coding, when encoding one character, a code consisting of an integral number of bits is generated. In contrast, according to the arithmetic coding, bits of fractions can be allocated to one character. According to the arithmetic coding, a number line that is 0 or larger but less than 1 (which is hereinafter represented such as [0, 1)) is sequentially narrowed in accordance with an occurrence probability (occurrence frequency) of each character constituting the data that should be encoded. Then, when finishing the processes for all characters, a numerical value representing one point within a narrowed interval is outputted as a code.
For example, there are five characters a, b, c, d and e as encoding targets, and occurrence probabilities of these characters are 0.2, 0.1, 0.05, 0.15 and 0.5, respectively. In this case, as shown in FIG. 23, an interval having an interval width corresponding to its occurrence probability is allocated to each character. Then, if a character string to be encoded is "abe", as schematically shown in FIG. 24, to start with, an interval [0, 1) is narrowed down to an interval [0, 0.2) for the character "a". Next, this interval [0, 0.2) is segmented into intervals corresponding to the occurrence probabilities of the respective characters, and an interval [0.04, 0.06) corresponding to the next character "b" is selected as an interval of a character string "ab". Then, this interval [0.04, 0.06) is further segmented into intervals corresponding to the occurrence probabilities of the respective characters, and an interval [0.05, 0.06) corresponding to the next character "e" is selected as an interval of the character string "abe". Outputted thereafter as an encoded result is a bit string under a decimal point when a position of an arbitrary point, e.g., a lower limit within that interval, is binary-displayed.
According to the arithmetic coding, it is also practiced that the occurrence probability of each subject character is obtained while being made to correspond to the character string (context) occurring just anterior to the subject character in order to further enhance a compression effect. In this case, the coding is, as schematically illustrated in FIG. 25, attained by an apparatus including a context modeling unit and a statistical coder. The context modeling unit, as illustrated in FIG. 26, stores the occurred character strings and counts the number of occurrences by use of the context tree as shown in FIG. 26, thus obtaining a probability depend on the preceding symbol (character). The statistical coder generates a code having a length corresponding to the probability obtained by the context modeling unit. Note that the statistical coder uses the probability before being updated when generating the code.
For instance, as schematically shown in FIG. 27, if source data with characters arranged in a sequence such as "abc" is inputted, the context modeling unit outputs to the statistical coder a probability p (c.sub.-- a, b) at which "c" defined as a coding target character occurs subsequent to "ab" defined as a context. Thereafter, the context collecting unit recalculates the conditional probability of each character on the basis of the fact that "c" again occurs subsequent to "ab".
Known are a variety of processing procedures concrete in the context collecting process. Such procedures are roughly classified into a type of "fixing a degree of context" (the number of characters of the context) and a "non-fixed" type (Blending context). According to the latter method, if a certain context is likely to occur, the degree of that context is increased. Whereas if a certain context is unlikely to occur, the degree remains low. Thus, the degree of each context changes adaptively to the input data.
The Huffman coding has, though capable of compressing data at a high velocity, such a defect that a high compression rate can not be obtained in the case of the ordinary data being a target. In contrast, according to the arithmetic coding that makes use of a context model, the high compression rate can be attained. However, a complicated calculation is required for performing the compression, and hence there exists such a defect that the data can not be compressed fast. Further, the data compression rate can be enhanced as a higher degree context model is employed. It, however, follows that a large storage capacity is needed for storing the data on the respective contexts. For this reason, the prior art data compression apparatus is, as a matter of fact, capable of doing nothing but preparing data about a limited number of contexts and is therefore incapable of sufficiently drawing the performance of the context model.